• Open

    TNN7: A Custom Macro Suite for Implementing Highly Optimized Designs of Neuromorphic TNNs. (arXiv:2205.07410v2 [cs.AR] UPDATED)
    Temporal Neural Networks (TNNs), inspired from the mammalian neocortex, exhibit energy-efficient online sensory processing capabilities. Recent works have proposed a microarchitecture framework for implementing TNNs and demonstrated competitive performance on vision and time-series applications. Building on these previous works, this work proposes TNN7, a suite of nine highly optimized custom macros developed using a predictive 7nm Process Design Kit (PDK), to enhance the efficiency, modularity and flexibility of the TNN design framework. TNN prototypes for two applications are used for evaluation of TNN7. An unsupervised time-series clustering TNN delivering competitive performance can be implemented within 40 uW power and 0.05 mm^2 area, while a 4-layer TNN that achieves an MNIST error rate of 1% consumes only 18 mW and 24.63 mm^2. On average, the proposed macros reduce power, delay, area, and energy-delay product by 14%, 16%, 28%, and 45%, respectively. Furthermore, employing TNN7 significantly reduces the synthesis runtime of TNN designs (by more than 3x), allowing for highly-scaled TNN implementations to be realized.
    Adaptive Fairness-Aware Online Meta-Learning for Changing Environments. (arXiv:2205.11264v2 [cs.LG] UPDATED)
    The fairness-aware online learning framework has arisen as a powerful tool for the continual lifelong learning setting. The goal for the learner is to sequentially learn new tasks where they come one after another over time and the learner ensures the statistic parity of the new coming task across different protected sub-populations (e.g. race and gender). A major drawback of existing methods is that they make heavy use of the i.i.d assumption for data and hence provide static regret analysis for the framework. However, low static regret cannot imply a good performance in changing environments where tasks are sampled from heterogeneous distributions. To address the fairness-aware online learning problem in changing environments, in this paper, we first construct a novel regret metric FairSAR by adding long-term fairness constraints onto a strongly adapted loss regret. Furthermore, to determine a good model parameter at each round, we propose a novel adaptive fairness-aware online meta-learning algorithm, namely FairSAOML, which is able to adapt to changing environments in both bias control and model precision. The problem is formulated in the form of a bi-level convex-concave optimization with respect to the model's primal and dual parameters that are associated with the model's accuracy and fairness, respectively. The theoretic analysis provides sub-linear upper bounds for both loss regret and violation of cumulative fairness constraints. Our experimental evaluation on different real-world datasets with settings of changing environments suggests that the proposed FairSAOML significantly outperforms alternatives based on the best prior online learning approaches.
    CNNs are Myopic. (arXiv:2205.10760v2 [cs.CV] UPDATED)
    We claim that Convolutional Neural Networks (CNNs) learn to classify images using only small seemingly unrecognizable tiles. We show experimentally that CNNs trained only using such tiles can match or even surpass the performance of CNNs trained on full images. Conversely, CNNs trained on full images show similar predictions on small tiles. We also propose the first a priori theoretical model for convolutional data sets that seems to explain this behavior. This gives additional support to the long standing suspicion that CNNs do not need to understand the global structure of images to achieve state-of-the-art accuracies. Surprisingly it also suggests that over-fitting is not needed either.
    Representation learning with function call graph transformations for malware open set recognition. (arXiv:2205.06918v2 [cs.CR] UPDATED)
    Open set recognition (OSR) problem has been a challenge in many machine learning (ML) applications, such as security. As new/unknown malware families occur regularly, it is difficult to exhaust samples that cover all the classes for the training process in ML systems. An advanced malware classification system should classify the known classes correctly while sensitive to the unknown class. In this paper, we introduce a self-supervised pre-training approach for the OSR problem in malware classification. We propose two transformations for the function call graph (FCG) based malware representations to facilitate the pretext task. Also, we present a statistical thresholding approach to find the optimal threshold for the unknown class. Moreover, the experiment results indicate that our proposed pre-training process can improve different performances of different downstream loss functions for the OSR problem.
    Nonparametric likelihood-free inference with Jensen-Shannon divergence for simulator-based models with categorical output. (arXiv:2205.10890v2 [stat.ME] UPDATED)
    Likelihood-free inference for simulator-based statistical models has recently attracted a surge of interest, both in the machine learning and statistics communities. The primary focus of these research fields has been to approximate the posterior distribution of model parameters, either by various types of Monte Carlo sampling algorithms or deep neural network -based surrogate models. Frequentist inference for simulator-based models has been given much less attention to date, despite that it would be particularly amenable to applications with big data where implicit asymptotic approximation of the likelihood is expected to be accurate and can leverage computationally efficient strategies. Here we derive a set of theoretical results to enable estimation, hypothesis testing and construction of confidence intervals for model parameters using asymptotic properties of the Jensen--Shannon divergence. Such asymptotic approximation offers a rapid alternative to more computation-intensive approaches and can be attractive for diverse applications of simulator-based models. 61
    Domain Adversarial Graph Convolutional Network Based on RSSI and Crowdsensing for Indoor Localization. (arXiv:2204.05184v2 [cs.NI] UPDATED)
    In recent years, due to the wider WiFi coverage and the popularization of mobile communication devices, the technology of indoor positioning using WiFi fingerprints has been rapidly developed. Currently, most supervised methods need to collect a large amount of data to construct fingerprint datasets, which is labor-intensive and time-consuming. In addition, many studies focused on the ideal laboratory environment and lack the consideration in the practical application environment, especially in the scenario of multiple large multi-floor buildings. To solve these problems, we proposed a novel WiDAGCN model which can be trained by a few labeled site survey data and unlabeled crowdsensing WiFi fingerprints. To comprehensively represent the topology structure of the data, we constructed heterogeneous graphs according to the received signal strength indicators (RSSIs) between the waypoints and WiFi access points (APs). Moreover, previous WiFi indoor localization studies rarely involved complete graph feature representation, thus we use graph convolutional network (GCN) to extract graph-level embeddings. There are also some difficult problems, for example, a large amount of unlabeled data that cannot be applied to a supervised model, and the existence of multiple data domains leads to inconsistency in data distribution. Therefore, a semi-supervised domain adversarial training scheme was used to make full use of unlabeled data and align the data distribution of different domains. A public indoor localization dataset containing different buildings was used to evaluate the performance of the model. The experimental results show that our system can achieve a competitive localization accuracy in large buildings such as shopping malls.
    A Multi-Stage Duplex Fusion ConvNet for Aerial Scene Classification. (arXiv:2203.16325v2 [cs.CV] UPDATED)
    Existing deep learning based methods effectively prompt the performance of aerial scene classification. However, due to the large amount of parameters and computational cost, it is rather difficult to apply these methods to multiple real-time remote sensing applications such as on-board data preception on drones and satellites. In this paper, we address this task by developing a light-weight ConvNet named multi-stage duplex fusion network (MSDF-Net). The key idea is to use parameters as little as possible while obtaining as strong as possible scene representation capability. To this end, a residual-dense duplex fusion strategy is developed to enhance the feature propagation while re-using parameters as much as possible, and is realized by our duplex fusion block (DFblock). Specifically, our MSDF-Net consists of multi-stage structures with DFblock. Moreover, duplex semantic aggregation (DSA) module is developed to mine the remote sensing scene information from extracted convolutional features, which also contains two parallel branches for semantic description. Extensive experiments are conducted on three widely-used aerial scene classification benchmarks, and reflect that our MSDF-Net can achieve a competitive performance against the recent state-of-art while reducing up to 80% parameter numbers. Particularly, an accuracy of 92.96% is achieved on AID with only 0.49M parameters.
    Near-Optimal Sparse Allreduce for Distributed Deep Learning. (arXiv:2201.07598v2 [cs.DC] UPDATED)
    Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes O$k$-Top$k$, a scheme for distributed training with sparse gradients. O$k$-Top$k$ integrates a novel sparse allreduce algorithm (less than 6$k$ communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O$k$-Top$k$ efficiently selects the top-$k$ gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint supercomputer with neural network models from different deep learning domains. Empirical results show that O$k$-Top$k$ achieves similar model accuracy to dense allreduce. Compared with the optimized dense and the state-of-the-art sparse allreduces, O$k$-Top$k$ is more scalable and significantly improves training throughput (e.g., 3.29x-12.95x improvement for BERT on 256 GPUs).
    ChemicalX: A Deep Learning Library for Drug Pair Scoring. (arXiv:2202.05240v3 [cs.LG] UPDATED)
    In this paper, we introduce ChemicalX, a PyTorch-based deep learning library designed for providing a range of state of the art models to solve the drug pair scoring task. The primary objective of the library is to make deep drug pair scoring models accessible to machine learning researchers and practitioners in a streamlined framework.The design of ChemicalX reuses existing high level model training utilities, geometric deep learning, and deep chemistry layers from the PyTorch ecosystem. Our system provides neural network layers, custom pair scoring architectures, data loaders, and batch iterators for end users. We showcase these features with example code snippets and case studies to highlight the characteristics of ChemicalX. A range of experiments on real world drug-drug interaction, polypharmacy side effect, and combination synergy prediction tasks demonstrate that the models available in ChemicalX are effective at solving the pair scoring task. Finally, we show that ChemicalX could be used to train and score machine learning models on large drug pair datasets with hundreds of thousands of compounds on commodity hardware.
    Worst-case Performance of Greedy Policies in Bandits with Imperfect Context Observations. (arXiv:2204.04773v2 [stat.ML] UPDATED)
    Contextual bandits are canonical models for sequential decision-making under uncertainty in environments with time-varying components. In this setting, the expected reward of each bandit arm consists of the inner product of an unknown parameter with the context vector of that arm. The classical bandit settings heavily rely on assuming that the contexts are fully observed, while study of the richer model of imperfectly observed contextual bandits is immature. This work considers Greedy reinforcement learning policies that take actions as if the current estimates of the parameter and of the unobserved contexts coincide with the corresponding true values. We establish that the non-asymptotic worst-case regret grows poly-logarithmically with the time horizon and the failure probability, while it scales linearly with the number of arms. Numerical analysis showcasing the above efficiency of Greedy policies is also provided.
    Lorentzian Fully Hyperbolic Generative Adversarial Network. (arXiv:2201.12825v2 [cs.LG] UPDATED)
    With the recent advance of deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. While a variety of hyperbolic neural network structures have been proposed, they mainly focus on discriminative tasks, and generative models in the hyperbolic space have scarcely been studied. In this work, we propose a hyperbolic generative adversarial network (GAN) within the Lorentz model for generating hyperbolic data. In addition to existing hyperbolic operations, we design novel hyperbolic layers to guarantee stable training. We first use synthetic data to show that our network is able to learn simple distribution in the hyperbolic space. Moreover, by virtue of an autoencoder, we construct a neural network model, named HAEGAN, for generating more complex data in the hyperbolic space. HAEGAN contains three parts: first, a hyperbolic autoencoder; second, a hyperbolic GAN for generating the latent embedding of the autoencoder; third, a generator that inherits the decoder from autoencoder and the generator from the GAN. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.
    Incremental Inference on Higher-Order Probabilistic Graphical Models Applied to Constraint Satisfaction Problems. (arXiv:2202.12916v2 [cs.LG] UPDATED)
    Probabilistic graphical models (PGMs) are tools for solving complex probabilistic relationships. However, suboptimal PGM structures are primarily used in practice. This dissertation presents three contributions to the PGM literature. The first is a comparison between factor graphs and cluster graphs on graph colouring problems such as Sudokus - indicating a significant advantage for preferring cluster graphs. The second is an application of cluster graphs to a practical problem in cartography: land cover classification boosting. The third is a PGMs formulation for constraint satisfaction problems and an algorithm called purge-and-merge to solve such problems too complex for traditional PGMs.
    Reproducibility in Optimization: Theoretical Framework and Limits. (arXiv:2202.04598v2 [math.OC] UPDATED)
    We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility.
    FedBalancer: Data and Pace Control for Efficient Federated Learning on Heterogeneous Clients. (arXiv:2201.01601v2 [cs.LG] UPDATED)
    Federated Learning (FL) trains a machine learning model on distributed clients without exposing individual data. Unlike centralized training that is usually based on carefully-organized data, FL deals with on-device data that are often unfiltered and imbalanced. As a result, conventional FL training protocol that treats all data equally leads to a waste of local computational resources and slows down the global learning process. To this end, we propose FedBalancer, a systematic FL framework that actively selects clients' training samples. Our sample selection strategy prioritizes more "informative" data while respecting privacy and computational capabilities of clients. To better utilize the sample selection to speed up global training, we further introduce an adaptive deadline control scheme that predicts the optimal deadline for each round with varying client training data. Compared with existing FL algorithms with deadline configuration methods, our evaluation on five datasets from three different domains shows that FedBalancer improves the time-to-accuracy performance by 1.20~4.48x while improving the model accuracy by 1.1~5.0%. We also show that FedBalancer is readily applicable to other FL approaches by demonstrating that FedBalancer improves the convergence speed and accuracy when operating jointly with three different FL algorithms.
    Domain-informed neural networks for interaction localization within astroparticle experiments. (arXiv:2112.07995v2 [hep-ex] UPDATED)
    This work proposes a domain-informed neural network architecture for experimental particle physics, using particle interaction localization with the time-projection chamber (TPC) technology for dark matter research as an example application. A key feature of the signals generated within the TPC is that they allow localization of particle interactions through a process called reconstruction. While multilayer perceptrons (MLPs) have emerged as a leading contender for reconstruction in TPCs, such a black-box approach does not reflect prior knowledge of the underlying scientific processes. This paper looks anew at neural network-based interaction localization and encodes prior detector knowledge, in terms of both signal characteristics and detector geometry, into the feature encoding and the output layers of a multilayer neural network. The resulting Domain-informed Neural Network (DiNN) limits the receptive fields of the neurons in the initial feature encoding layers in order to account for the spatially localized nature of the signals produced within the TPC. This aspect of the DiNN, which has similarities with the emerging area of graph neural networks in that the neurons in the initial layers only connect to a handful of neurons in their succeeding layer, significantly reduces the number of parameters in the network in comparison to an MLP. In addition, in order to account for the detector geometry, the output layers of the network are modified using two geometric transformations to ensure the DiNN produces localizations within the interior of the detector. The end result is a neural network architecture that has 60% fewer parameters than an MLP, but that still achieves similar localization performance and provides a path to future architectural developments with improved performance because of their ability to encode additional domain knowledge into the architecture.
    Nonlinear Transform Source-Channel Coding for Semantic Communications. (arXiv:2112.10961v2 [cs.IT] UPDATED)
    In this paper, we propose a class of high-efficiency deep joint source-channel coding methods that can closely adapt to the source distribution under the nonlinear transform, it can be collected under the name nonlinear transform source-channel coding (NTSCC). In the considered model, the transmitter first learns a nonlinear analysis transform to map the source data into latent space, then transmits the latent representation to the receiver via deep joint source-channel coding. Our model incorporates the nonlinear transform as a strong prior to effectively extract the source semantic features and provide side information for source-channel coding. Unlike existing conventional deep joint source-channel coding methods, the proposed NTSCC essentially learns both the source latent representation and an entropy model as the prior on the latent representation. Accordingly, novel adaptive rate transmission and hyperprior-aided codec refinement mechanisms are developed to upgrade deep joint source-channel coding. The whole system design is formulated as an optimization problem whose goal is to minimize the end-to-end transmission rate-distortion performance under established perceptual quality metrics. Across test image sources with various resolutions, we find that the proposed NTSCC transmission method generally outperforms both the analog transmission using the standard deep joint source-channel coding and the classical separation-based digital transmission. Notably, the proposed NTSCC method can potentially support future semantic communications due to its content-aware ability and perceptual optimization goal.
    The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks. (arXiv:2108.11489v2 [stat.ML] UPDATED)
    The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of $\textit{benign overfitting}$ has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and sub-Gaussian. By leveraging recent results that characterize the implicit bias of this estimator, our bounds emphasize the role of both the quality of the initialization as well as the properties of the data covariance matrix in achieving low excess risk.
    Classification of COVID-19 on chest X-Ray images using Deep Learning model with Histogram Equalization and Lungs Segmentation. (arXiv:2112.02478v2 [eess.IV] UPDATED)
    Background and Objective: Artificial intelligence (AI) methods coupled with biomedical analysis has a critical role during pandemics as it helps to release the overwhelming pressure from healthcare systems and physicians. As the ongoing COVID-19 crisis worsens in countries having dense populations and inadequate testing kits like Brazil and India, radiological imaging can act as an important diagnostic tool to accurately classify covid-19 patients and prescribe the necessary treatment in due time. With this motivation, we present our study based on deep learning architecture for detecting covid-19 infected lungs using chest X-rays. Dataset: We collected a total of 2470 images for three different class labels, namely, healthy lungs, ordinary pneumonia, and covid-19 infected pneumonia, out of which 470 X-ray images belong to the covid-19 category. Methods: We first pre-process all the images using histogram equalization techniques and segment them using U-net architecture. VGG-16 network is then used for feature extraction from the pre-processed images which is further sampled by SMOTE oversampling technique to achieve a balanced dataset. Finally, the class-balanced features are classified using a support vector machine (SVM) classifier with 10-fold cross-validation and the accuracy is evaluated. Result and Conclusion: Our novel approach combining well-known pre-processing techniques, feature extraction methods, and dataset balancing method, lead us to an outstanding rate of recognition of 98% for COVID-19 images over a dataset of 2470 X-ray images. Our model is therefore fit to be utilized in healthcare facilities for screening purposes.
    Evaluating Generalization in Classical and Quantum Generative Models. (arXiv:2201.08770v2 [cs.LG] UPDATED)
    Defining and accurately measuring generalization in generative models remains an ongoing challenge and a topic of active research within the machine learning community. This is in contrast to discriminative models, where there is a clear definition of generalization, i.e., the model's classification accuracy when faced with unseen data. In this work, we construct a simple and unambiguous approach to evaluate the generalization capabilities of generative models. Using the sample-based generalization metrics proposed here, any generative model, from state-of-the-art classical generative models such as GANs to quantum models such as Quantum Circuit Born Machines, can be evaluated on the same ground on a concrete well-defined framework. In contrast to other sample-based metrics for probing generalization, we leverage constrained optimization problems (e.g., cardinality constrained problems) and use these discrete datasets to define specific metrics capable of unambiguously measuring the quality of the samples and the model's generalization capabilities for generating data beyond the training set but still within the valid solution space. Additionally, our metrics can diagnose trainability issues such as mode collapse and overfitting, as we illustrate when comparing GANs to quantum-inspired models built out of tensor networks. Our simulation results show that our quantum-inspired models have up to a $68 \times$ enhancement in generating unseen unique and valid samples compared to GANs, and a ratio of 61:2 for generating samples with better quality than those observed in the training set. We foresee these metrics as valuable tools for rigorously defining practical quantum advantage in the domain of generative modeling.
    Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds. (arXiv:2110.12087v3 [cs.LG] UPDATED)
    Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior. To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints. We characterize the sample variance bounds and show that the decision made by BES is explainable. Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization.
    Improved Fine-Tuning by Better Leveraging Pre-Training Data. (arXiv:2111.12292v2 [cs.CV] UPDATED)
    As a dominant paradigm, fine-tuning a pre-trained model on the target data is widely used in many deep learning applications, especially for small data sets. However, recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy once the number of training samples is increased in some vision tasks. In this work, we revisit this phenomenon from the perspective of generalization analysis by using excess risk bound which is popular in learning theory. The result reveals that the excess risk bound may have a weak dependency on the pre-trained model. The observation inspires us to leverage pre-training data for fine-tuning, since this data is also available for fine-tuning. The generalization result of using pre-training data shows that the excess risk bound on a target task can be improved when the appropriate pre-training data is included in fine-tuning. With the theoretical motivation, we propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task. Extensive experimental results for image classification tasks on 8 benchmark data sets verify the effectiveness of the proposed data selection based fine-tuning pipeline.
    DeepTrack: Lightweight Deep Learning for Vehicle Path Prediction in Highways. (arXiv:2108.00505v2 [cs.LG] UPDATED)
    Vehicle trajectory prediction is essential for enabling safety-critical intelligent transportation systems (ITS) applications used in management and operations. While there have been some promising advances in the field, there is a need for modern deep learning algorithms that allow real-time trajectory prediction on embedded IoT devices. This article presents DeepTrack, a novel deep learning algorithm customized for real-time vehicle trajectory prediction and monitoring applications in arterial management, freeway management, traffic incident management, and work zone management for high-speed incoming traffic. In contrast to previous methods, the vehicle dynamics are encoded using Temporal Convolutional Networks (TCNs) to provide more robust time prediction with less computation. DeepTrack also uses depthwise convolution, which reduces the complexity of models compared to existing approaches in terms of model size and operations. Overall, our experimental results demonstrate that DeepTrack achieves comparable accuracy to state-of-the-art trajectory prediction models but with smaller model sizes and lower computational complexity, making it more suitable for real-world deployment.
    Trustworthy AI: From Principles to Practices. (arXiv:2110.01167v2 [cs.AI] UPDATED)
    The rapid development of Artificial Intelligence (AI) technology has enabled the deployment of various systems based on it. However, many current AI systems are found vulnerable to imperceptible attacks, biased against underrepresented groups, lacking in user privacy protection. These shortcomings degrade user experience and erode people's trust in all AI systems. In this review, we provide AI practitioners with a comprehensive guide for building trustworthy AI systems. We first introduce the theoretical framework of important aspects of AI trustworthiness, including robustness, generalization, explainability, transparency, reproducibility, fairness, privacy preservation, and accountability. To unify currently available but fragmented approaches toward trustworthy AI, we organize them in a systematic approach that considers the entire lifecycle of AI systems, ranging from data acquisition to model development, to system development and deployment, finally to continuous monitoring and governance. In this framework, we offer concrete action items for practitioners and societal stakeholders (e.g., researchers, engineers, and regulators) to improve AI trustworthiness. Finally, we identify key opportunities and challenges for the future development of trustworthy AI systems, where we identify the need for a paradigm shift toward comprehensively trustworthy AI systems.
    FedHM: Efficient Federated Learning for Heterogeneous Models via Low-rank Factorization. (arXiv:2111.14655v2 [cs.LG] UPDATED)
    One underlying assumption of recent federated learning (FL) paradigms is that all local models usually share the same network architecture and size, which becomes impractical for devices with different hardware resources. A scalable federated learning framework should address the heterogeneity that clients have different computing capacities and communication capabilities. To this end, this paper proposes FedHM, a novel heterogeneous federated model compression framework, distributing the heterogeneous low-rank models to clients and then aggregating them into a full-rank model. Our solution enables the training of heterogeneous models with varying computational complexities and aggregates them into a single global model. Furthermore, FedHM significantly reduces the communication cost by using low-rank models. Extensive experimental results demonstrate that FedHM is superior in the performance and robustness of models of different sizes, compared with state-of-the-art heterogeneous FL methods under various FL settings. Additionally, the convergence guarantee of FL for heterogeneous devices is first theoretically analyzed.
    Learning Perceptual Locomotion on Uneven Terrains using Sparse Visual Observations. (arXiv:2109.14026v2 [cs.RO] UPDATED)
    To proactively navigate and traverse various terrains, active use of visual perception becomes indispensable. We aim to investigate the feasibility and performance of using sparse visual observations to achieve perceptual locomotion over a range of common terrains (steps, ramps, gaps, and stairs) in human-centered environments. We formulate a selection of sparse visual inputs suitable for locomotion over the terrains of interest, and propose a learning framework to integrate exteroceptive and proprioceptive states. We specifically design the state observations and a training curriculum to learn feedback control policies effectively over a range of different terrains. We extensively validate and benchmark the learned policy in various tasks: omnidirectional walking on flat ground, and forward locomotion over various obstacles, showing high success rate of traversability. Furthermore, we study exteroceptive ablations and evaluate policy generalization by adding various levels of noise and testing on new unseen terrains. We demonstrate the capabilities of autonomous perceptual locomotion that can be achieved by only using sparse visual observations from direct depth measurements, which are easily available from a Lidar or RGB-D sensor, showing robust ascent and descent over high stairs of 20 cm height, i.e., 50% leg length, and robustness against noise and unseen terrains.
    RKHS-SHAP: Shapley Values for Kernel Methods. (arXiv:2110.09167v2 [stat.ML] UPDATED)
    Feature attribution for kernel methods is often heuristic and not individualised for each prediction. To address this, we turn to the concept of Shapley values~(SV), a coalition game theoretical framework that has previously been applied to different machine learning model interpretation tasks, such as linear models, tree ensembles and deep networks. By analysing SVs from a functional perspective, we propose \textsc{RKHS-SHAP}, an attribution method for kernel machines that can efficiently compute both \emph{Interventional} and \emph{Observational Shapley values} using kernel mean embeddings of distributions. We show theoretically that our method is robust with respect to local perturbations - a key yet often overlooked desideratum for consistent model interpretation. Further, we propose \emph{Shapley regulariser}, applicable to a general empirical risk minimisation framework, allowing learning while controlling the level of specific feature's contributions to the model. We demonstrate that the Shapley regulariser enables learning which is robust to covariate shift of a given feature and fair learning which controls the SVs of sensitive features.
    Towards the Generalization of Contrastive Self-Supervised Learning. (arXiv:2111.00743v3 [cs.LG] UPDATED)
    Recently, self-supervised learning has attracted great attention, since it only requires unlabeled data for training. Contrastive learning is one popular method for self-supervised learning and has achieved promising empirical performance. However, the theoretical understanding of its generalization ability is still limited. To this end, we define a kind of $(\sigma,\delta)$-measure to mathematically quantify the data augmentation, and then provide an upper bound of the downstream classification error based on the measure. We show that the generalization ability of contrastive self-supervised learning depends on three key factors: alignment of positive samples, divergence of class centers, and concentration of augmented data. The first two factors can be optimized by contrastive algorithms, while the third one is priorly determined by pre-defined data augmentation. With the above theoretical findings, we further study two canonical contrastive losses, InfoNCE and cross-correlation loss, and prove that both of them are indeed able to satisfy the first two factors. Moreover, we empirically verify the third factor by conducting various experiments on the real-world dataset, and show that our theoretical inferences on the relationship between the data augmentation and the generalization of contrastive self-supervised learning agree with the empirical observations.
    Amplitude Mean of Functional Data on $\mathbb{S}^2$. (arXiv:2107.13721v4 [stat.ML] UPDATED)
    Manifold-valued functional data analysis (FDA) recently becomes an active area of research motivated by the raising availability of trajectories or longitudinal data observed on non-linear manifolds. The challenges of analyzing such data come from many aspects, including infinite dimensionality and nonlinearity, as well as time-domain or phase variability. In this paper, we study the amplitude part of manifold-valued functions on $\mathbb{S}^2$, which is invariant to random time warping or re-parameterization. Utilizing the nice geometry of $\mathbb{S}^2$, we develop a set of efficient and accurate tools for temporal alignment of functions, geodesic computing, and sample mean calculation. At the heart of these tools, they rely on gradient descent algorithms with carefully derived gradients. We show the advantages of these newly developed tools over its competitors with extensive simulations and real data and demonstrate the importance of considering the amplitude part of functions instead of mixing it with phase variability in manifold-valued FDA.
    TIP: Task-Informed Motion Prediction for Intelligent Vehicles. (arXiv:2110.08750v2 [cs.RO] UPDATED)
    When predicting trajectories of road agents, motion predictors usually approximate the future distribution by a limited number of samples. This constraint requires the predictors to generate samples that best support the task given task specifications. However, existing predictors are often optimized and evaluated via task-agnostic measures without accounting for the use of predictions in downstream tasks, and thus could result in sub-optimal task performance. In this paper, we propose a task-informed motion prediction model that better supports the tasks through its predictions, by jointly reasoning about prediction accuracy and the utility of the downstream tasks, which is commonly used to evaluate the task performance. The task utility function does not require the full task information, but rather a specification of the utility of the task, resulting in predictors that serve a wide range of downstream tasks. We demonstrate our approach on two use cases of common decision making tasks and their utility functions, in the context of autonomous driving and parallel autonomy. Experiment results show that our predictor produces accurate predictions that improve the task performance by a large margin in both tasks when compared to task-agnostic baselines on the Waymo Open Motion dataset.
    Coherent Probabilistic Aggregate Queries on Long-horizon Forecasts. (arXiv:2111.03394v2 [cs.LG] UPDATED)
    Long range forecasts are the starting point of many decision support systems that need to draw inference from high-level aggregate patterns on forecasted values. State of the art time-series forecasting methods are either subject to concept drift on long-horizon forecasts, or fail to accurately predict coherent and accurate high-level aggregates. In this work, we present a novel probabilistic forecasting method that produces forecasts that are coherent in terms of base level and predicted aggregate statistics. We achieve the coherency between predicted base-level and aggregate statistics using a novel inference method based on KL-divergence that can be solved efficiently in closed form. We show that our method improves forecast performance across both base level and unseen aggregates post inference on real datasets ranging three diverse domains. (\href{https://github.com/pratham16cse/AggForecaster}{Project URL})
    Machine Learning Construction: implications to cybersecurity. (arXiv:1906.10019v3 [cs.LG] UPDATED)
    Statistical learning is the process of estimating an unknown probabilistic input-output relationship of a system using a limited number of observations; a statistical learning machine (SLM) is the algorithm, function, model, or rule, that learns such a process; and machine learning (ML) is the conventional name of this field. ML and its applications are ubiquitous in the modern world. Cyberphysical systems such as Automatic target recognition (ATR) in military applications, computer aided diagnosis (CAD) in medical imaging, DNA microarrays in genomics, optical character recognition (OCR), speech recognition (SR), spam email filtering, stock market prediction, etc., are few examples and applications for ML; diverse fields but one theory. In particular, ML has gained a lot of attention in the field of cyberphysical security, especially in the last decade. It is of great importance to this field to design detection algorithms that have the capability of learning from security data to be able to hunt threats, achieve better monitoring, master the complexity of the threat intelligence feeds, and achieve timely remediation of security incidents. The field of ML can be decomposed into two basic subfields: \textit{construction} and \textit{assessment}. We mean by \textit{construction} designing or inventing an appropriate algorithm that learns from the input data and achieves a good performance according to some optimality criterion. We mean by \textit{assessment} attributing some performance measures to the constructed ML algorithm, along with their estimators, to objectively assess this algorithm.
    MixR: Data Mixing Augmentation for Regression. (arXiv:2106.03374v3 [cs.LG] UPDATED)
    Data augmentation is becoming essential for improving regression accuracy in critical applications including manufacturing, climate prediction, and finance. Existing techniques for data augmentation largely focus on classification tasks and do not readily apply to regression tasks. In particular, the recent Mixup techniques for classification have succeeded in improving the model performance, which is reasonable due to the characteristics of the classification task, but has limitations in regression. We show that mixing examples that have large data distances using linear interpolations may have increasingly-negative effects on model performance. Hence, we use the stricter assumption that linearity only holds within certain data distances for regression where the degree may vary by each example. We then propose MixR, a data augmentation framework for regression that learns for each example how many nearest neighbors it should be mixed with for the best model performance using a validation set. Our experiments conducted both on synthetic and real datasets show that MixR significantly outperforms state-of-the-art data augmentation baselines applicable to regression. MixR can also be integrated with existing Mixup techniques to significantly improve their performances.
    Subspace clustering in high-dimensions: Phase transitions \& Statistical-to-Computational gap. (arXiv:2205.13527v1 [stat.ML])
    A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $\rho$, as well as the ratio $\alpha$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $\lambda_{\text{alg}} \ge k / \sqrt{\alpha} $ to perform better than random, and the information theoretic threshold at $\lambda_{\text{it}} \approx \sqrt{-k \rho \log{\rho}} / \sqrt{\alpha}$. Finally, we discuss the case of sub-extensive sparsity $\rho$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding.
    Verifying Learning-Based Robotic Navigation Systems. (arXiv:2205.13536v1 [cs.RO])
    Deep reinforcement learning (DRL) has become a dominant deep-learning paradigm for various tasks in which complex policies are learned within reactive systems. In parallel, there has recently been significant research on verifying deep neural networks. However, to date, there has been little work demonstrating the use of modern verification tools on real, DRL-controlled systems. In this case-study paper, we attempt to begin bridging this gap, and focus on the important task of mapless robotic navigation -- a classic robotics problem, in which a robot, usually controlled by a DRL agent, needs to efficiently and safely navigate through an unknown arena towards a desired target. We demonstrate how modern verification engines can be used for effective model selection, i.e., the process of selecting the best available policy for the robot in question from a pool of candidate policies. Specifically, we use verification to detect and rule out policies that may demonstrate suboptimal behavior, such as collisions and infinite loops. We also apply verification to identify models with overly conservative behavior, thus allowing users to choose superior policies that are better at finding an optimal, shorter path to a target. To validate our work, we conducted extensive experiments on an actual robot, and confirmed that the suboptimal policies detected by our method were indeed flawed. We also compared our verification-driven approach to state-of-the-art gradient attacks, and our results demonstrate that gradient-based methods are inadequate in this setting. Our work is the first to demonstrate the use of DNN verification backends for recognizing suboptimal DRL policies in real-world robots, and for filtering out unwanted policies. We believe that the methods presented in this work can be applied to a large range of application domains that incorporate deep-learning-based agents.
    Steerable 3D Spherical Neurons. (arXiv:2106.13863v6 [cs.CV] UPDATED)
    Emerging from low-level vision theory, steerable filters found their counterpart in prior work on steerable convolutional neural networks equivariant to rigid transformations. In our work, we propose a steerable feed-forward learning-based approach that consists of neurons with spherical decision surfaces and operates on point clouds. Such spherical neurons are obtained by conformal embedding of Euclidean space and have recently been revisited in the context of learning representations of point sets. Focusing on 3D geometry, we exploit the isometry property of spherical neurons and derive a 3D steerability constraint. After training spherical neurons to classify point clouds in a canonical orientation, we use a tetrahedron basis to quadruplicate the neurons and construct rotation-equivariant spherical filter banks. We then apply the derived constraint to interpolate the filter bank outputs and, thus, obtain a rotation-invariant network. Finally, we use a synthetic point set and real-world 3D skeleton data to verify our theoretical findings.
    Ranking the information content of distance measures. (arXiv:2104.15079v2 [stat.ML] UPDATED)
    Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Using the fewest features but still retaining sufficient information about the system is crucial in many statistical learning approaches, particularly when data are sparse. We introduce a statistical test that can assess the relative information retained when using two different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This in turn allows finding the most informative distance measure out of a pool of candidates. The approach is applied to find the most relevant policy variables for controlling the Covid-19 epidemic and to find compact yet informative representations of atomic structures, but its potential applications are wide ranging in many branches of science.
    Mitigating barren plateaus of variational quantum eigensolvers. (arXiv:2205.13539v1 [quant-ph])
    Variational quantum algorithms (VQAs) are expected to establish valuable applications on near-term quantum computers. However, recent works have pointed out that the performance of VQAs greatly relies on the capability of the ansatzes and is seriously limited by optimization issues such as barren plateaus (i.e., vanishing gradients). This work proposes the state efficient ansatz (SEA) for accurate quantum dynamics simulations with improved trainability. First, we show that SEA can generate an arbitrary pure state with much fewer parameters than a universal ansatz, making it efficient for tasks like ground state estimation. It also has the flexibility in adjusting the entanglement of the prepared state, which could be applied to further improve the efficiency of simulating weak entanglement. Second, we show that SEA is not a unitary 2-design even if it has universal wavefunction expressibility and thus has great potential to improve the trainability by avoiding the zone of barren plateaus. We further investigate a plethora of examples in ground state estimation and notably obtain significant improvements in the variances of derivatives and the overall optimization behaviors. This result indicates that SEA can mitigate barren plateaus by sacrificing the redundant expressibility for the target problem.
    Epistemic Neural Networks. (arXiv:2107.08924v3 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of \textit{joint} predictions of labels across multiple inputs. Conventional neural networks lack this capability and, since most research has focused on marginal predictions, this shortcoming has been largely overlooked. We introduce the \textit{epistemic neural network} (ENN) as an interface for models that represent uncertainty as required to generate useful joint predictions. While prior approaches to uncertainty modeling such as Bayesian neural networks can be expressed as ENNs, this new interface facilitates comparison of joint predictions and the design of novel architectures and algorithms. In particular, we introduce the \textit{epinet}: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. We demonstrate this efficacy across synthetic data, ImageNet, and some reinforcement learning tasks. As part of this effort we open-source experiment code.
    TrustyAI Explainability Toolkit. (arXiv:2104.12717v2 [cs.AI] UPDATED)
    Artificial intelligence (AI) is becoming increasingly more popular and can be found in workplaces and homes around the world. The decisions made by such "black box" systems are often opaque; that is, so complex as to be functionally impossible to understand. How do we ensure that these systems are behaving as desired? TrustyAI is an initiative which looks into explainable artificial intelligence (XAI) solutions to address this issue of explainability in the context of both AI models and decision services. This paper presents the TrustyAI Explainability Toolkit, a Java and Python library that provides XAI explanations of decision services and predictive models for both enterprise and data science use-cases. We describe the TrustyAI implementations and extensions to techniques such as LIME, SHAP and counterfactuals, which are benchmarked against existing implementations in a variety of experiments.
    Machine Learning Assessment: implications to cybersecurity. (arXiv:1907.12851v4 [stat.ML] UPDATED)
    This chapter is dedicated to the assessment and performance estimation of machine learning (ML) algorithms, a topic that is equally important to the construction of these algorithms, in particular in the context of cyberphysical security design. The literature is full of nonparametric methods to estimate a statistic from just one available dataset through resampling techniques, e.g., jackknife, bootstrap and cross validation (CV). Special statistics of great interest are the error rate and the area under the ROC curve (AUC) of a classification rule. The importance of these resampling methods stems from the fact that they require no knowledge about the probability distribution of the data or the construction details of the ML algorithm. This chapter provides a concise review of this literature to establish a coherent theoretical framework for these methods that can estimate both the error rate (a one-sample statistic) and the AUC (a two-sample statistic). The resampling methods are usually computationally expensive, because they rely on repeating the training and testing of a ML algorithm after each resampling iteration. Therefore, the practical applicability of some of these methods may be limited to the traditional ML algorithms rather than the very computationally demanding approaches of the recent deep neural networks (DNN). In the field of cyberphysical security, many applications generate structured (tabular) data, which can be fed to all traditional ML approaches. This is in contrast to the DNN approaches, which favor unstructured data, e.g., images, text, voice, etc.; hence, the relevance of this chapter to this field.%
    TempoRL: Temporal Priors for Exploration in Off-Policy Reinforcement Learning. (arXiv:2205.13528v1 [cs.LG])
    Efficient exploration is a crucial challenge in deep reinforcement learning. Several methods, such as behavioral priors, are able to leverage offline data in order to efficiently accelerate reinforcement learning on complex tasks. However, if the task at hand deviates excessively from the demonstrated task, the effectiveness of such methods is limited. In our work, we propose to learn features from offline data that are shared by a more diverse range of tasks, such as correlation between actions and directedness. Therefore, we introduce state-independent temporal priors, which directly model temporal consistency in demonstrated trajectories, and are capable of driving exploration in complex tasks, even when trained on data collected on simpler tasks. Furthermore, we introduce a novel integration scheme for action priors in off-policy reinforcement learning by dynamically sampling actions from a probabilistic mixture of policy and action prior. We compare our approach against strong baselines and provide empirical evidence that it can accelerate reinforcement learning in long-horizon continuous control tasks under sparse reward settings.
    Spherical Message Passing for 3D Graph Networks. (arXiv:2102.05013v3 [cs.LG] UPDATED)
    We consider representation learning from 3D graphs in which each node is associated with a spatial position in 3D. This is an under explored area of research, and a principled framework is currently lacking. In this work, we propose a generic framework, known as the 3D graph network (3DGN), to provide a unified interface at different levels of granularity for 3D graphs. Built on 3DGN, we propose the spherical message passing (SMP) as a novel and specific scheme for realizing the 3DGN framework in the spherical coordinate system (SCS). We conduct formal analyses and show that the relative location of each node in 3D graphs is uniquely defined in the SMP scheme. Thus, our SMP represents a complete and accurate architecture for learning from 3D graphs in the SCS. We derive physically-based representations of geometric information and propose the SphereNet for learning representations of 3D graphs. We show that existing 3D deep models can be viewed as special cases of the SphereNet. Experimental results demonstrate that the use of complete and accurate 3D information in 3DGN and SphereNet leads to significant performance improvements in prediction tasks.
    Selective Classification Via Neural Network Training Dynamics. (arXiv:2205.13532v1 [cs.LG])
    Selective classification is the task of rejecting inputs a model would predict incorrectly on through a trade-off between input space coverage and model accuracy. Current methods for selective classification impose constraints on either the model architecture or the loss function; this inhibits their usage in practice. In contrast to prior work, we show that state-of-the-art selective classification performance can be attained solely from studying the (discretized) training dynamics of a model. We propose a general framework that, for a given test input, monitors metrics capturing the disagreement with the final predicted label over intermediate models obtained during training; we then reject data points exhibiting too much disagreement at late stages in training. In particular, we instantiate a method that tracks when the label predicted during training stops disagreeing with the final predicted label. Our experimental evaluation shows that our method achieves state-of-the-art accuracy/coverage trade-offs on typical selective classification benchmarks. For example, we improve coverage on CIFAR-10/SVHN by 10.1%/1.5% respectively at a fixed target error of 0.5%.
    Revealing the Dark Secrets of Masked Image Modeling. (arXiv:2205.13543v1 [cs.CV])
    Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction.
    Memory AMP. (arXiv:2012.10861v6 [cs.IT] UPDATED)
    Approximate message passing (AMP) is a low-cost iterative parameter-estimation technique for certain high-dimensional linear systems with non-Gaussian distributions. However, AMP only applies to independent identically distributed (IID) transform matrices, but may become unreliable (e.g., perform poorly or even diverge) for other matrix ensembles, especially for ill-conditioned ones. Orthogonal/vector AMP (OAMP/VAMP) was proposed for general right-unitarily-invariant matrices to handle this difficulty. However, the Bayes-optimal OAMP/VAMP (BO-OAMP/VAMP) requires a high-complexity linear minimum mean square error (MMSE) estimator. This limits the application of OAMP/VAMP to large-scale systems. To solve the disadvantages of AMP and BO-OAMP/VAMP, this paper proposes a memory AMP (MAMP) framework under an orthogonality principle, which guarantees the asymptotic IID Gaussianity of estimation errors in MAMP. We present an orthogonalization procedure for the local memory estimators to realize the required orthogonality for MAMP. Furthermore, we propose a Bayes-optimal MAMP (BO-MAMP), in which a long-memory matched filter is proposed for interference suppression. The complexity of BO-MAMP is comparable to AMP. A state evolution is derived to asymptotically characterize the performance of BO-MAMP. Based on state evolution, the relaxation parameters and damping vector in BO-MAMP are optimized. For all right-unitarily-invariant matrices, the state evolution of the optimized BO-MAMP converges to the same fixed point as that of the high-complexity BO-OAMP/VAMP and is Bayes-optimal if its state evolution has a unique fixed point. Finally, simulations are provided to verify the validity and accuracy of the theoretical results.
    Semantic Parsing of Interpage Relations. (arXiv:2205.13530v1 [cs.LG])
    Page-level analysis of documents has been a topic of interest in digitization efforts, and multimodal approaches have been applied to both classification and page stream segmentation. In this work, we focus on capturing finer semantic relations between pages of a multi-page document. To this end, we formalize the task as semantic parsing of interpage relations and we propose an end-to-end approach for interpage dependency extraction, inspired by the dependency parsing literature. We further design a multi-task training approach to jointly optimize for page embeddings to be used in segmentation, classification, and parsing of the page dependencies using textual and visual features extracted from the pages. Moreover, we also combine the features from two modalities to obtain multimodal page embeddings. To the best of our knowledge, this is the first study to extract rich semantic interpage relations from multi-page documents. Our experimental results show that the proposed method increased LAS by 41 percentage points for semantic parsing, increased accuracy by 33 percentage points for page stream segmentation, and 45 percentage points for page classification over a naive baseline.
    Transfer learning driven design optimization for inertial confinement fusion. (arXiv:2205.13519v1 [physics.plasm-ph])
    Transfer learning is a promising approach to creating predictive models that incorporate simulation and experimental data into a common framework. In this technique, a neural network is first trained on a large database of simulations, then partially retrained on sparse sets of experimental data to adjust predictions to be more consistent with reality. Previously, this technique has been used to create predictive models of Omega and NIF inertial confinement fusion (ICF) experiments that are more accurate than simulations alone. In this work, we conduct a transfer learning driven hypothetical ICF campaign in which the goal is to maximize experimental neutron yield via Bayesian optimization. The transfer learning model achieves yields within 5% of the maximum achievable yield in a modest-sized design space in fewer than 20 experiments. Furthermore, we demonstrate that this method is more efficient at optimizing designs than traditional model calibration techniques commonly employed in ICF design. Such an approach to ICF design could enable robust optimization of experimental performance under uncertainty.
    Sparse Graph Learning for Spatiotemporal Time Series. (arXiv:2205.13492v1 [cs.LG])
    Outstanding achievements of graph neural networks for spatiotemporal time series prediction show that relational constraints introduce a positive inductive bias into neural forecasting architectures. Often, however, the relational information characterizing the underlying data generating process is unavailable; the practitioner is then left with the problem of inferring from data which relational graph to use in the subsequent processing stages. We propose novel, principled -- yet practical -- probabilistic methods that learn the relational dependencies by modeling distributions over graphs while maximizing, at the same time, end-to-end the forecasting accuracy. Our novel graph learning approach, based on consolidated variance reduction techniques for Monte Carlo score-based gradient estimation, is theoretically grounded and effective. We show that tailoring the gradient estimators to the graph learning problem allows us also for achieving state-of-the-art forecasting performance while controlling, at the same time, both the sparsity of the learned graph and the computational burden. We empirically assess the effectiveness of the proposed method on synthetic and real-world benchmarks, showing that the proposed solution can be used as a stand-alone graph identification procedure as well as a learned component of an end-to-end forecasting architecture.
    AutoTSG: Learning and Synthesis for Incident Troubleshooting. (arXiv:2205.13457v1 [cs.SE])
    Incident management is a key aspect of operating large-scale cloud services. To aid with faster and efficient resolution of incidents, engineering teams document frequent troubleshooting steps in the form of Troubleshooting Guides (TSGs), to be used by on-call engineers (OCEs). However, TSGs are siloed, unstructured, and often incomplete, requiring developers to manually understand and execute necessary steps. This results in a plethora of issues such as on-call fatigue, reduced productivity, and human errors. In this work, we conduct a large-scale empirical study of over 4K+ TSGs mapped to 1000s of incidents and find that TSGs are widely used and help significantly reduce mitigation efforts. We then analyze feedback on TSGs provided by 400+ OCEs and propose a taxonomy of issues that highlights significant gaps in TSG quality. To alleviate these gaps, we investigate the automation of TSGs and propose AutoTSG -- a novel framework for automation of TSGs to executable workflows by combining machine learning and program synthesis. Our evaluation of AutoTSG on 50 TSGs shows the effectiveness in both identifying TSG statements (accuracy 0.89) and parsing them for execution (precision 0.94 and recall 0.91). Lastly, we survey ten Microsoft engineers and show the importance of TSG automation and the usefulness of AutoTSG.
    Pick up the PACE: Fast and Simple Domain Adaptation via Ensemble Pseudo-Labeling. (arXiv:2205.13508v1 [cs.LG])
    Domain Adaptation (DA) has received widespread attention from deep learning researchers in recent years because of its potential to improve test accuracy with out-of-distribution labeled data. Most state-of-the-art DA algorithms require an extensive amount of hyperparameter tuning and are computationally intensive due to the large batch sizes required. In this work, we propose a fast and simple DA method consisting of three stages: (1) domain alignment by covariance matching, (2) pseudo-labeling, and (3) ensembling. We call this method $\textbf{PACE}$, for $\textbf{P}$seudo-labels, $\textbf{A}$lignment of $\textbf{C}$ovariances, and $\textbf{E}$nsembles. PACE is trained on top of fixed features extracted from an ensemble of modern pretrained backbones. PACE exceeds previous state-of-the-art by $\textbf{5 - 10 \%}$ on most benchmark adaptation tasks without training a neural network. PACE reduces training time and hyperparameter tuning time by $82\%$ and $97\%$, respectively, when compared to state-of-the-art DA methods. Code is released here: https://github.com/Chris210634/PACE-Domain-Adaptation
    Training ReLU networks to high uniform accuracy is intractable. (arXiv:2205.13531v1 [cs.LG])
    Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -- for example in a security-critical context or for problems in the computational sciences -- accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture. As a corollary we conclude that the training of ReLU neural networks to high uniform accuracy is intractable. In a security-critical context this points to the fact that deep learning based systems are prone to being fooled by a possible adversary. We corroborate our theoretical findings by numerical results.
    Learning to Reconstruct Missing Data from Spatiotemporal Graphs with Sparse Observations. (arXiv:2205.13479v1 [cs.LG])
    Modeling multivariate time series as temporal signals over a (possibly dynamic) graph is an effective representational framework that allows for developing models for time series analysis. In fact, discrete sequences of graphs can be processed by autoregressive graph neural networks to recursively learn representations at each discrete point in time and space. Spatiotemporal graphs are often highly sparse, with time series characterized by multiple, concurrent, and even long sequences of missing data, e.g., due to the unreliable underlying sensor network. In this context, autoregressive models can be brittle and exhibit unstable learning dynamics. The objective of this paper is, then, to tackle the problem of learning effective models to reconstruct, i.e., impute, missing data points by conditioning the reconstruction only on the available observations. In particular, we propose a novel class of attention-based architectures that, given a set of highly sparse discrete observations, learn a representation for points in time and space by exploiting a spatiotemporal diffusion architecture aligned with the imputation task. Representations are trained end-to-end to reconstruct observations w.r.t. the corresponding sensor and its neighboring nodes. Compared to the state of the art, our model handles sparse data without propagating prediction errors or requiring a bidirectional model to encode forward and backward time dependencies. Empirical results on representative benchmarks show the effectiveness of the proposed method.
    Censored Quantile Regression Neural Networks. (arXiv:2205.13496v1 [stat.ML])
    This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.
    Mesoscopic modeling of hidden spiking neurons. (arXiv:2205.13493v1 [q-bio.NC])
    Can we use spiking neural networks (SNN) as generative models of multi-neuronal recordings, while taking into account that most neurons are unobserved? Modeling the unobserved neurons with large pools of hidden spiking neurons leads to severely underconstrained problems that are hard to tackle with maximum likelihood estimation. In this work, we use coarse-graining and mean-field approximations to derive a bottom-up, neuronally-grounded latent variable model (neuLVM), where the activity of the unobserved neurons is reduced to a low-dimensional mesoscopic description. In contrast to previous latent variable models, neuLVM can be explicitly mapped to a recurrent, multi-population SNN, giving it a transparent biological interpretation. We show, on synthetic spike trains, that a few observed neurons are sufficient for neuLVM to perform efficient model inversion of large SNNs, in the sense that it can recover connectivity parameters, infer single-trial latent population activity, reproduce ongoing metastable dynamics, and generalize when subjected to perturbations mimicking photo-stimulation.
    FedAug: Reducing the Local Learning Bias Improves Federated Learning on Heterogeneous Data. (arXiv:2205.13462v1 [cs.LG])
    Federated Learning (FL) is a machine learning paradigm that learns from data kept locally to safeguard the privacy of clients, whereas local SGD is typically employed on the clients' devices to improve communication efficiency. However, such a scheme is currently constrained by the slow and unstable convergence induced by clients' heterogeneous data. In this work, we identify three under-explored phenomena of the biased local learning that may explain these challenges caused by local updates in supervised FL. As a remedy, we propose FedAug, a novel unified algorithm that reduces the local learning bias on features and classifiers to tackle these challenges. FedAug consists of two components: AugMean and AugCA. AugMean alleviates the bias in the local classifiers by balancing the output distribution of models. AugCA learns client invariant features that are close to global features but considerably distinct from those learned from other input distributions. In a series of experiments, we show that FedAug consistently outperforms other SOTA FL and domain generalization (DG) baselines, in which both two components (i.e., AugMean and AugCA) have individual performance gains.
    SigMaNet: One Laplacian to Rule Them All. (arXiv:2205.13459v1 [cs.LG])
    This paper introduces SigMaNet, a generalized Graph Convolutional Network (GCN) capable of handling both undirected and directed graphs with weights not restricted in sign and magnitude. The cornerstone of SigMaNet is the introduction of a generalized Laplacian matrix: the Sign-Magnetic Laplacian ($L^\sigma$). The adoption of such a matrix allows us to bridge a gap in the current literature by extending the theory of spectral GCNs to directed graphs with both positive and negative weights. $L^{\sigma}$ exhibits several desirable properties not enjoyed by the traditional Laplacian matrices on which several state-of-the-art architectures are based. In particular, $L^\sigma$ is completely parameter-free, which is not the case of Laplacian operators such as the Magnetic Laplacian $L^{(q)}$, where the calibration of the parameter q is an essential yet problematic component of the operator. $L^\sigma$ simplifies the approach, while also allowing for a natural interpretation of the signs of the edges in terms of their directions. The versatility of the proposed approach is amply demonstrated experimentally; the proposed network SigMaNet turns out to be competitive in all the tasks we considered, regardless of the graph structure.
    SemAffiNet: Semantic-Affine Transformation for Point Cloud Segmentation. (arXiv:2205.13490v1 [cs.CV])
    Conventional point cloud semantic segmentation methods usually employ an encoder-decoder architecture, where mid-level features are locally aggregated to extract geometric information. However, the over-reliance on these class-agnostic local geometric representations may raise confusion between local parts from different categories that are similar in appearance or spatially adjacent. To address this issue, we argue that mid-level features can be further enhanced with semantic information, and propose semantic-affine transformation that transforms features of mid-level points belonging to different categories with class-specific affine parameters. Based on this technique, we propose SemAffiNet for point cloud semantic segmentation, which utilizes the attention mechanism in the Transformer module to implicitly and explicitly capture global structural knowledge within local parts for overall comprehension of each category. We conduct extensive experiments on the ScanNetV2 and NYUv2 datasets, and evaluate semantic-affine transformation on various 3D point cloud and 2D image segmentation baselines, where both qualitative and quantitative results demonstrate the superiority and generalization ability of our proposed approach. Code is available at https://github.com/wangzy22/SemAffiNet.
    An Analytic Framework for Robust Training of Artificial Neural Networks. (arXiv:2205.13502v1 [cs.LG])
    The reliability of a learning model is key to the successful deployment of machine learning in various industries. Creating a robust model, particularly one unaffected by adversarial attacks, requires a comprehensive understanding of the adversarial examples phenomenon. However, it is difficult to describe the phenomenon due to the complicated nature of the problems in machine learning. Consequently, many studies investigate the phenomenon by proposing a simplified model of how adversarial examples occur and validate it by predicting some aspect of the phenomenon. While these studies cover many different characteristics of the adversarial examples, they have not reached a holistic approach to the geometric and analytic modeling of the phenomenon. This paper propose a formal framework to study the phenomenon in learning theory and make use of complex analysis and holomorphicity to offer a robust learning rule for artificial neural networks. With the help of complex analysis, we can effortlessly move between geometric and analytic perspectives of the phenomenon and offer further insights on the phenomenon by revealing its connection with harmonic functions. Using our model, we can explain some of the most intriguing characteristics of adversarial examples, including transferability of adversarial examples, and pave the way for novel approaches to mitigate the effects of the phenomenon.
    Mutual Information Divergence: A Unified Metric for Multimodal Generative Models. (arXiv:2205.13445v1 [cs.CV])
    Text-to-image generation and image captioning are recently emerged as a new experimental paradigm to assess machine intelligence. They predict continuous quantity accompanied by their sampling techniques in the generation, making evaluation complicated and intractable to get marginal distributions. Based on a recent trend that multimodal generative evaluations exploit a vison-and-language pre-trained model, we propose the negative Gaussian cross-mutual information using the CLIP features as a unified metric, coined by Mutual Information Divergence (MID). To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model. We look forward to seeing the underrepresented implications of the Gaussian cross-mutual information in multimodal representation learning and the future works based on this novel proposition.
    DeepJoint: Robust Survival Modelling Under Clinical Presence Shift. (arXiv:2205.13481v1 [cs.LG])
    Observational data in medicine arise as a result of the complex interaction between patients and the healthcare system. The sampling process is often highly irregular and itself constitutes an informative process. When using such data to develop prediction models, this phenomenon is often ignored, leading to sub-optimal performance and generalisability of models when practices evolve. We propose a multi-task recurrent neural network which models three clinical presence dimensions -- namely the longitudinal, the inter-observation and the missingness processes -- in parallel to the survival outcome. On a prediction task using MIMIC III laboratory tests, explicit modelling of these three processes showed improved performance in comparison to state-of-the-art predictive models (C-index at 1 day horizon: 0.878). More importantly, the proposed approach was more robust to change in the clinical presence setting, demonstrated by performance comparison between patients admitted on weekdays and weekends. This analysis demonstrates the importance of studying and leveraging clinical presence to improve performance and create more transportable clinical models.
    Machine Learning Models Are Not Necessarily Biased When Constructed Properly: Evidence from Neuroimaging Studies. (arXiv:2205.13421v1 [cs.LG])
    Despite the great promise that machine learning has offered in many fields of medicine, it has also raised concerns about potential biases and poor generalization across genders, age distributions, races and ethnicities, hospitals, and data acquisition equipment and protocols. In the current study, and in the context of three brain diseases, we provide experimental data which support that when properly trained, machine learning models can generalize well across diverse conditions and do not suffer from biases. Specifically, by using multi-study magnetic resonance imaging consortia for diagnosing Alzheimer's disease, schizophrenia, and autism spectrum disorder, we find that, the accuracy of well-trained models is consistent across different subgroups pertaining to attributes such as gender, age, and racial groups, as also different clinical studies. We find that models that incorporate multi-source data from demographic, clinical, genetic factors and cognitive scores are also unbiased. These models have better predictive accuracy across subgroups than those trained only with structural measures in some cases but there are also situations when these additional features do not help.
    A Fair Federated Learning Framework With Reinforcement Learning. (arXiv:2205.13415v1 [cs.LG])
    Federated learning (FL) is a paradigm where many clients collaboratively train a model under the coordination of a central server, while keeping the training data locally stored. However, heterogeneous data distributions over different clients remain a challenge to mainstream FL algorithms, which may cause slow convergence, overall performance degradation and unfairness of performance across clients. To address these problems, in this study we propose a reinforcement learning framework, called PG-FFL, which automatically learns a policy to assign aggregation weights to clients. Additionally, we propose to utilize Gini coefficient as the measure of fairness for FL. More importantly, we apply the Gini coefficient and validation accuracy of clients in each communication round to construct a reward function for the reinforcement learning. Our PG-FFL is also compatible to many existing FL algorithms. We conduct extensive experiments over diverse datasets to verify the effectiveness of our framework. The experimental results show that our framework can outperform baseline methods in terms of overall performance, fairness and convergence speed.
    Avoiding Barren Plateaus with Classical Deep Neural Networks. (arXiv:2205.13418v1 [quant-ph])
    Variational quantum algorithms (VQAs) are among the most promising algorithms in the era of Noisy Intermediate Scale Quantum Devices. The VQAs are applied to a variety of tasks, such as in chemistry simulations, optimization problems, and quantum neural networks. Such algorithms are constructed using a parameterization U($\pmb{\theta}$) with a classical optimizer that updates the parameters $\pmb{\theta}$ in order to minimize a cost function $C$. For this task, in general the gradient descent method, or one of its variants, is used. This is a method where the circuit parameters are updated iteratively using the cost function gradient. However, several works in the literature have shown that this method suffers from a phenomenon known as the Barren Plateaus (BP). This phenomenon is characterized by the exponentially flattening of the cost function landscape, so that the number of times the function must be evaluated to perform the optimization grows exponentially as the number of qubits and parameterization depth increase. In this article, we report on how the use of a classical neural networks in the VQAs input parameters can alleviate the BP phenomenon.
    TransBoost: Improving the Best ImageNet Performance using Deep Transduction. (arXiv:2205.13331v1 [cs.CV])
    This paper deals with deep transductive learning, and proposes TransBoost as a procedure for fine-tuning any deep neural model to improve its performance on any (unlabeled) test set provided at training time. TransBoost is inspired by a large margin principle and is efficient and simple to use. The ImageNet classification performance is consistently and significantly improved with TransBoost on many architectures such as ResNets, MobileNetV3-L, EfficientNetB0, ViT-S, and ConvNext-T. Additionally we show that TransBoost is effective on a wide variety of image classification datasets.
    BppAttack: Stealthy and Efficient Trojan Attacks against Deep Neural Networks via Image Quantization and Contrastive Adversarial Learning. (arXiv:2205.13383v1 [cs.CV])
    Deep neural networks are vulnerable to Trojan attacks. Existing attacks use visible patterns (e.g., a patch or image transformations) as triggers, which are vulnerable to human inspection. In this paper, we propose stealthy and efficient Trojan attacks, BppAttack. Based on existing biology literature on human visual systems, we propose to use image quantization and dithering as the Trojan trigger, making imperceptible changes. It is a stealthy and efficient attack without training auxiliary models. Due to the small changes made to images, it is hard to inject such triggers during training. To alleviate this problem, we propose a contrastive learning based approach that leverages adversarial attacks to generate negative sample pairs so that the learned trigger is precise and accurate. The proposed method achieves high attack success rates on four benchmark datasets, including MNIST, CIFAR-10, GTSRB, and CelebA. It also effectively bypasses existing Trojan defenses and human inspection. Our code can be found in https://github.com/RU-System-Software-and-Security/BppAttack.
    Opinion Spam Detection: A New Approach Using Machine Learning and Network-Based Algorithms. (arXiv:2205.13422v1 [cs.LG])
    E-commerce is the fastest-growing segment of the economy. Online reviews play a crucial role in helping consumers evaluate and compare products and services. As a result, fake reviews (opinion spam) are becoming more prevalent and negatively impacting customers and service providers. There are many reasons why it is hard to identify opinion spammers automatically, including the absence of reliable labeled data. This limitation precludes an off-the-shelf application of a machine learning pipeline. We propose a new method for classifying reviewers as spammers or benign, combining machine learning with a message-passing algorithm that capitalizes on the users' graph structure to compensate for the possible scarcity of labeled data. We devise a new way of sampling the labels for the training step (active learning), replacing the typical uniform sampling. Experiments on three large real-world datasets from Yelp.com show that our method outperforms state-of-the-art active learning approaches and also machine learning methods that use a much larger set of labeled data for training.
    Looking for Out-of-Distribution Environments in Critical Care: A case study with the eICU Database. (arXiv:2205.13398v1 [cs.LG])
    Generalizing to new populations and domains in machine learning is still an open problem which has seen increased interest recently. In particular, clinical models show a significant performance drop when tested in settings not seen during training, e.g., new hospitals or population demographics. Recently proposed models for domain generalisation promise to alleviate this problem by learning invariant characteristics across environments, however, there is still scepticism about whether they improve over traditional training. In this work, we take a principled approach to identifying Out of Distribution (OoD) environments, motivated by the problem of cross-hospital generalization in critical care. We propose model-based and heuristic approaches to identify OoD environments and systematically compare models with different levels of held-out information. In particular, based on the assumption that models with access to OoD data should outperform other models, we train models across a range of experimental setups that include leave-one-hospital-out training and cross-sectional feature splits. We find that access to OoD data does not translate to increased performance, pointing to inherent limitations in defining potential OoD environments in the eICU Database potentially due to data harmonisation and sampling. Echoing similar results with other popular clinical benchmarks in the literature, new approaches are required to evaluate robust models in critical care.
    Transfer and Share: Semi-Supervised Learning from Long-Tailed Data. (arXiv:2205.13358v1 [cs.LG])
    Long-Tailed Semi-Supervised Learning (LTSSL) aims to learn from class-imbalanced data where only a few samples are annotated. Existing solutions typically require substantial cost to solve complex optimization problems, or class-balanced undersampling which can result in information loss. In this paper, we present the TRAS (TRAnsfer and Share) to effectively utilize long-tailed semi-supervised data. TRAS transforms the imbalanced pseudo-label distribution of a traditional SSL model via a delicate function to enhance the supervisory signals for minority classes. It then transfers the distribution to a target model such that the minority class will receive significant attention. Interestingly, TRAS shows that more balanced pseudo-label distribution can substantially benefit minority-class training, instead of seeking to generate accurate pseudo-labels as in previous works. To simplify the approach, TRAS merges the training of the traditional SSL model and the target model into a single procedure by sharing the feature extractor, where both classifiers help improve the representation learning. According to extensive experiments, TRAS delivers much higher accuracy than state-of-the-art methods in the entire set of classes as well as minority classes.
    Feature Forgetting in Continual Representation Learning. (arXiv:2205.13359v1 [cs.LG])
    In continual and lifelong learning, good representation learning can help increase performance and reduce sample complexity when learning new tasks. There is evidence that representations do not suffer from "catastrophic forgetting" even in plain continual learning, but little further fact is known about its characteristics. In this paper, we aim to gain more understanding about representation learning in continual learning, especially on the feature forgetting problem. We devise a protocol for evaluating representation in continual learning, and then use it to present an overview of the basic trends of continual representation learning, showing its consistent deficiency and potential issues. To study the feature forgetting problem, we create a synthetic dataset to identify and visualize the prevalence of feature forgetting in neural networks. Finally, we propose a simple technique using gating adapters to mitigate feature forgetting. We conclude by discussing that improving representation learning benefits both old and new tasks in continual learning.
    How Powerful are K-hop Message Passing Graph Neural Networks. (arXiv:2205.13328v1 [cs.LG])
    The most popular design paradigm for Graph Neural Networks (GNNs) is 1-hop message passing -- aggregating features from 1-hop neighbors repeatedly. However, the expressive power of 1-hop message passing is bounded by the Weisfeiler-Lehman (1-WL) test. Recently, researchers extended 1-hop message passing to K-hop message passing by aggregating information from K-hop neighbors of nodes simultaneously. However, there is no work on analyzing the expressive power of K-hop message passing. In this work, we theoretically characterize the expressive power of K-hop message passing. Specifically, we first formally differentiate two kinds of kernels of K-hop message passing which are often misused in previous works. We then characterize the expressive power of K-hop message passing by showing that it is more powerful than 1-hop message passing. Despite the higher expressive power, we show that K-hop message passing still cannot distinguish some simple regular graphs. To further enhance its expressive power, we introduce a KP-GNN framework, which improves K-hop message passing by leveraging the peripheral subgraph information in each hop. We prove that KP-GNN can distinguish almost all regular graphs including some distance regular graphs which could not be distinguished by previous distance encoding methods. Experimental results verify the expressive power and effectiveness of KP-GNN. KP-GNN achieves competitive results across all benchmark datasets.
    Deep Active Learning with Noise Stability. (arXiv:2205.13340v1 [cs.LG])
    Uncertainty estimation for unlabeled data is crucial to active learning. With a deep neural network employed as the backbone model, the data selection process is highly challenging due to the potential over-confidence of the model inference. Existing methods resort to special learning fashions (e.g. adversarial) or auxiliary models to address this challenge. This tends to result in complex and inefficient pipelines, which would render the methods impractical. In this work, we propose a novel algorithm that leverages noise stability to estimate data uncertainty in a Single-Training Multi-Inference fashion. The key idea is to measure the output derivation from the original observation when the model parameters are randomly perturbed by noise. We provide theoretical analyses by leveraging the small Gaussian noise theory and demonstrate that our method favors a subset with large and diverse gradients. Despite its simplicity, our method outperforms the state-of-the-art active learning baselines in various tasks, including computer vision, natural language processing, and structural data analysis.
    QUICK-FL: Quick Unbiased Compression for Federated Learning. (arXiv:2205.13341v1 [cs.LG])
    Distributed Mean Estimation (DME) is a fundamental building block in communication efficient federated learning. In DME, clients communicate their lossily compressed gradients to the parameter server, which estimates the average and updates the model. State of the art DME techniques apply either unbiased quantization methods, resulting in large estimation errors, or biased quantization methods, where unbiasing the result requires that the server decodes each gradient individually, which markedly slows the aggregation time. In this paper, we propose QUIC-FL, a DME algorithm that achieves the best of all worlds. QUIC-FL is unbiased, offers fast aggregation time, and is competitive with the most accurate (slow aggregation) DME techniques. To achieve this, we formalize the problem in a novel way that allows us to use standard solvers to design near-optimal unbiased quantization schemes.
    Acute Lymphoblastic Leukemia Detection Using Hypercomplex-Valued Convolutional Neural Networks. (arXiv:2205.13273v1 [cs.CV])
    This paper features convolutional neural networks defined on hypercomplex algebras applied to classify lymphocytes in blood smear digital microscopic images. Such classification is helpful for the diagnosis of acute lymphoblast leukemia (ALL), a type of blood cancer. We perform the classification task using eight hypercomplex-valued convolutional neural networks (HvCNNs) along with real-valued convolutional networks. Our results show that HvCNNs perform better than the real-valued model, showcasing higher accuracy with a much smaller number of parameters. Moreover, we found that HvCNNs based on Clifford algebras processing HSV-encoded images attained the highest observed accuracies. Precisely, our HvCNN yielded an average accuracy rate of 96.6% using the ALL-IDB2 dataset with a 50% train-test split, a value extremely close to the state-of-the-art models but using a much simpler architecture with significantly fewer parameters.
    On the Eigenvalues of Global Covariance Pooling for Fine-grained Visual Recognition. (arXiv:2205.13282v1 [cs.CV])
    The Fine-Grained Visual Categorization (FGVC) is challenging because the subtle inter-class variations are difficult to be captured. One notable research line uses the Global Covariance Pooling (GCP) layer to learn powerful representations with second-order statistics, which can effectively model inter-class differences. In our previous conference paper, we show that truncating small eigenvalues of the GCP covariance can attain smoother gradient and improve the performance on large-scale benchmarks. However, on fine-grained datasets, truncating the small eigenvalues would make the model fail to converge. This observation contradicts the common assumption that the small eigenvalues merely correspond to the noisy and unimportant information. Consequently, ignoring them should have little influence on the performance. To diagnose this peculiar behavior, we propose two attribution methods whose visualizations demonstrate that the seemingly unimportant small eigenvalues are crucial as they are in charge of extracting the discriminative class-specific features. Inspired by this observation, we propose a network branch dedicated to magnifying the importance of small eigenvalues. Without introducing any additional parameters, this branch simply amplifies the small eigenvalues and achieves state-of-the-art performances of GCP methods on three fine-grained benchmarks. Furthermore, the performance is also competitive against other FGVC approaches on larger datasets. Code is available at \href{https://github.com/KingJamesSong/DifferentiableSVD}{https://github.com/KingJamesSong/DifferentiableSVD}.
    Fair Representation Learning through Implicit Path Alignment. (arXiv:2205.13316v1 [cs.LG])
    We consider a fair representation learning perspective, where optimal predictors, on top of the data representation, are ensured to be invariant with respect to different sub-groups. Specifically, we formulate this intuition as a bi-level optimization, where the representation is learned in the outer-loop, and invariant optimal group predictors are updated in the inner-loop. Moreover, the proposed bi-level objective is demonstrated to fulfill the sufficiency rule, which is desirable in various practical scenarios but was not commonly studied in the fair learning. Besides, to avoid the high computational and memory cost of differentiating in the inner-loop of bi-level objective, we propose an implicit path alignment algorithm, which only relies on the solution of inner optimization and the implicit differentiation rather than the exact optimization path. We further analyze the error gap of the implicit approach and empirically validate the proposed method in both classification and regression settings. Experimental results show the consistently better trade-off in prediction performance and fairness measurement.
    Towards Learning Universal Hyperparameter Optimizers with Transformers. (arXiv:2205.13320v1 [cs.LG])
    Meta-learning hyperparameter optimization (HPO) algorithms from prior experiments is a promising approach to improve optimization efficiency over objective functions from a similar distribution. However, existing methods are restricted to learning from experiments sharing the same set of hyperparameters. In this paper, we introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction when trained on vast tuning data from the wild. Our extensive experiments demonstrate that the OptFormer can imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates. Compared to a Gaussian Process, the OptFormer also learns a robust prior distribution for hyperparameter response functions, and can thereby provide more accurate and better calibrated predictions. This work paves the path to future extensions for training a Transformer-based model as a general HPO optimizer.
    Investigating classification learning curves for automatically generated and labelled plant images. (arXiv:2205.10955v2 [cs.LG] UPDATED)
    In the context of supervised machine learning a learning curve describes how a model's performance on unseen data relates to the amount of samples used to train the model. In this paper we present a dataset of plant images with representatives of crops and weeds common to the Manitoba prairies at different growth stages. We determine the learning curve for a classification task on this data with the ResNet architecture. Our results are in accordance with previous studies and add to the evidence that learning curves are governed by power-law relationships over large scales, applications, and models. We further investigate how label noise and the reduction of trainable parameters impacts the learning curve on this dataset. Both effects lead to the model requiring disproportionally larger training sets to achieve the same classification performance as observed without these effects.
    SARS-CoV-2 Result Interpretation based on Image Analysis of Lateral Flow Devices. (arXiv:2205.13311v1 [cs.LG])
    The widely used gene quantisation technique, Lateral Flow Device (LFD), is now commonly used to detect the presence of SARS-CoV-2. It is enabling the control and prevention of the spread of the virus. Depending on the viral load, LFD have different sensitivity and self-test for normal user present additional challenge to interpret the result. With the evolution of machine learning algorithms, image processing and analysis has seen unprecedented growth. In this interdisciplinary study, we employ novel image analysis methods of computer vision and machine learning field to study visual features of the control region of LFD. Here, we automatically derive results for any image containing LFD into positive, negative or inconclusive. This will reduce the burden of human involvement of health workers and perception bias.
    Triangular Contrastive Learning on Molecular Graphs. (arXiv:2205.13279v1 [cs.LG])
    Recent contrastive learning methods have shown to be effective in various tasks, learning generalizable representations invariant to data augmentation thereby leading to state of the art performances. Regarding the multifaceted nature of large unlabeled data used in self-supervised learning while majority of real-word downstream tasks use single format of data, a multimodal framework that can train single modality to learn diverse perspectives from other modalities is an important challenge. In this paper, we propose TriCL (Triangular Contrastive Learning), a universal framework for trimodal contrastive learning. TriCL takes advantage of Triangular Area Loss, a novel intermodal contrastive loss that learns the angular geometry of the embedding space through simultaneously contrasting the area of positive and negative triplets. Systematic observation on embedding space in terms of alignment and uniformity showed that Triangular Area Loss can address the line-collapsing problem by discriminating modalities by angle. Our experimental results also demonstrate the outperformance of TriCL on downstream task of molecular property prediction which implies that the advantages of the embedding space indeed benefits the performance on downstream tasks.
    Gaussian Universality of Linear Classifiers with Random Labels in High-Dimension. (arXiv:2205.13303v1 [stat.ML])
    While classical in many theoretical settings, the assumption of Gaussian i.i.d. inputs is often perceived as a strong limitation in the analysis of high-dimensional learning. In this study, we redeem this line of work in the case of generalized linear classification with random labels. Our main contribution is a rigorous proof that data coming from a range of generative models in high-dimensions have the same minimum training loss as Gaussian data with corresponding data covariance. In particular, our theorem covers data created by an arbitrary mixture of homogeneous Gaussian clouds, as well as multi-modal generative neural networks. In the limit of vanishing regularization, we further demonstrate that the training loss is independent of the data covariance. Finally, we show that this universality property is observed in practice with real datasets and random labels.
    Privacy-Preserving Wavelet Wavelet Neural Network with Fully Homomorphic Encryption. (arXiv:2205.13265v1 [cs.LG])
    The main aim of Privacy-Preserving Machine Learning (PPML) is to protect the privacy and provide security to the data used in building Machine Learning models. There are various techniques in PPML such as Secure Multi-Party Computation, Differential Privacy, and Homomorphic Encryption (HE). The techniques are combined with various Machine Learning models and even Deep Learning Networks to protect the data privacy as well as the identity of the user. In this paper, we propose a fully homomorphic encrypted wavelet neural network to protect privacy and at the same time not compromise on the efficiency of the model. We tested the effectiveness of the proposed method on seven datasets taken from the finance and healthcare domains. The results show that our proposed model performs similarly to the unencrypted model.
    SymNMF-Net for The Symmetric NMF Problem. (arXiv:2205.13214v1 [cs.LG])
    Recently, many works have demonstrated that Symmetric Non-negative Matrix Factorization~(SymNMF) enjoys a great superiority for various clustering tasks. Although the state-of-the-art algorithms for SymNMF perform well on synthetic data, they cannot consistently obtain satisfactory results with desirable properties and may fail on real-world tasks like clustering. Considering the flexibility and strong representation ability of the neural network, in this paper, we propose a neural network called SymNMF-Net for the Symmetric NMF problem to overcome the shortcomings of traditional optimization algorithms. Each block of SymNMF-Net is a differentiable architecture with an inversion layer, a linear layer and ReLU, which are inspired by a traditional update scheme for SymNMF. We show that the inference of each block corresponds to a single iteration of the optimization. Furthermore, we analyze the constraints of the inversion layer to ensure the output stability of the network to a certain extent. Empirical results on real-world datasets demonstrate the superiority of our SymNMF-Net and confirm the sufficiency of our theoretical analysis.
    Penalizing Proposals using Classifiers for Semi-Supervised Object Detection. (arXiv:2205.13219v1 [cs.CV])
    Obtaining gold standard annotated data for object detection is often costly, involving human-level effort. Semi-supervised object detection algorithms solve the problem with a small amount of gold-standard labels and a large unlabelled dataset used to generate silver-standard labels. But training on the silver standard labels does not produce good results, because they are machine-generated annotations. In this work, we design a modified loss function to train on large silver standard annotated sets generated by a weak annotator. We include a confidence metric associated with the annotation as an additional term in the loss function, signifying the quality of the annotation. We test the effectiveness of our approach on various test sets and use numerous variations to compare the results with some of the current approaches to object detection. In comparison with the baseline where no confidence metric is used, we achieved a 4\% gain in mAP with 25\% labeled data and 10\% gain in mAP with 50\% labeled data by using the proposed confidence metric.
    Evaluating Multimodal Interactive Agents. (arXiv:2205.13274v1 [cs.LG])
    Creating agents that can interact naturally with humans is a common goal in artificial intelligence (AI) research. However, evaluating these interactions is challenging: collecting online human-agent interactions is slow and expensive, yet faster proxy metrics often do not correlate well with interactive evaluation. In this paper, we assess the merits of these existing evaluation metrics and present a novel approach to evaluation called the Standardised Test Suite (STS). The STS uses behavioural scenarios mined from real human interaction data. Agents see replayed scenario context, receive an instruction, and are then given control to complete the interaction offline. These agent continuations are recorded and sent to human annotators to mark as success or failure, and agents are ranked according to the proportion of continuations in which they succeed. The resulting STS is fast, controlled, interpretable, and representative of naturalistic interactions. Altogether, the STS consolidates much of what is desirable across many of our standard evaluation metrics, allowing us to accelerate research progress towards producing agents that can interact naturally with humans. https://youtu.be/YR1TngGORGQ
    Federated Split BERT for Heterogeneous Text Classification. (arXiv:2205.13299v1 [cs.CL])
    Pre-trained BERT models have achieved impressive performance in many natural language processing (NLP) tasks. However, in many real-world situations, textual data are usually decentralized over many clients and unable to be uploaded to a central server due to privacy protection and regulations. Federated learning (FL) enables multiple clients collaboratively to train a global model while keeping the local data privacy. A few researches have investigated BERT in federated learning setting, but the problem of performance loss caused by heterogeneous (e.g., non-IID) data over clients remain under-explored. To address this issue, we propose a framework, FedSplitBERT, which handles heterogeneous data and decreases the communication cost by splitting the BERT encoder layers into local part and global part. The local part parameters are trained by the local client only while the global part parameters are trained by aggregating gradients of multiple clients. Due to the sheer size of BERT, we explore a quantization method to further reduce the communication cost with minimal performance loss. Our framework is ready-to-use and compatible to many existing federated learning algorithms, including FedAvg, FedProx and FedAdam. Our experiments verify the effectiveness of the proposed framework, which outperforms baseline methods by a significant margin, while FedSplitBERT with quantization can reduce the communication cost by $11.9\times$.
    The Effect of Task Ordering in Continual Learning. (arXiv:2205.13323v1 [cs.LG])
    We investigate the effect of task ordering on continual learning performance. We conduct an extensive series of empirical experiments on synthetic and naturalistic datasets and show that reordering tasks significantly affects the amount of catastrophic forgetting. Connecting to the field of curriculum learning, we show that the effect of task ordering can be exploited to modify continual learning performance, and present a simple approach for doing so. Our method computes the distance between all pairs of tasks, where distance is defined as the source task curvature of a gradient step toward the target task. Using statistically rigorous methods and sound experimental design, we show that task ordering is an important aspect of continual learning that can be modified for improved performance.
    DeepTechnome: Mitigating Unknown Bias in Deep Learning Based Assessment of CT Images. (arXiv:2205.13297v1 [eess.IV])
    Reliably detecting diseases using relevant biological information is crucial for real-world applicability of deep learning techniques in medical imaging. We debias deep learning models during training against unknown bias - without preprocessing/filtering the input beforehand or assuming specific knowledge about its distribution or precise nature in the dataset. We use control regions as surrogates that carry information regarding the bias, employ the classifier model to extract features, and suppress biased intermediate features with our custom, modular DecorreLayer. We evaluate our method on a dataset of 952 lung computed tomography scans by introducing simulated biases w.r.t. reconstruction kernel and noise level and propose including an adversarial test set in evaluations of bias reduction techniques. In a moderately sized model architecture, applying the proposed method to learn from data exhibiting a strong bias, it near-perfectly recovers the classification performance observed when training with corresponding unbiased data.
    DT-SV: A Transformer-based Time-domain Approach for Speaker Verification. (arXiv:2205.13249v1 [cs.SD])
    Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech. In the past few years, extracting speaker embeddings using deep neural networks for SV systems has gone mainstream. Recently, different attention mechanisms and Transformer networks have been explored widely in SV fields. However, utilizing the original Transformer in SV directly may have frame-level information waste on output features, which could lead to restrictions on capacity and discrimination of speaker embeddings. Therefore, we propose an approach to derive utterance-level speaker embeddings via a Transformer architecture that uses a novel loss function named diffluence loss to integrate the feature information of different Transformer layers. Therein, the diffluence loss aims to aggregate frame-level features into an utterance-level representation, and it could be integrated into the Transformer expediently. Besides, we also introduce a learnable mel-fbank energy feature extractor named time-domain feature extractor that computes the mel-fbank features more precisely and efficiently than the standard mel-fbank extractor. Combining Diffluence loss and Time-domain feature extractor, we propose a novel Transformer-based time-domain SV model (DT-SV) with faster training speed and higher accuracy. Experiments indicate that our proposed model can achieve better performance in comparison with other models.
    Collaborative Distillation Meta Learning for Simulation Intensive Hardware Design. (arXiv:2205.13225v1 [cs.LG])
    This paper proposes a novel collaborative distillation meta learning (CDML) framework for simulation intensive hardware design problems. Deep reinforcement learning (DRL) has shown promising performance in various hardware design problems. However, previous works on DRL-based hardware design only dealt with problems with simplified objectives, which are not practical. In fact, the objective evaluation of real-world electrical performance through simulation is costly in terms of both time and computation, making DRL scheme involving extensive reward calculations not suitable. In this paper, we apply the CDML framework to decoupling capacitor placement problem (DPP), one of the significant simulation intensive hardware design problems. The CDML framework consists of a context-based meta learner and collaborative distillation scheme to produce a reusable solver. The context-based meta learner captures the location of probing port (i.e., target circuit block) and improves generalization capability. The collaborative distillation scheme with equivariant label transformation imposes the action-permutation (AP)-equivariant nature of placement problems, which not only improves sample efficiency but also improves generalization capability. Extensive experimental results verified that our CDML outperforms both neural baselines and iterative conventional design methods in terms of real-world objective, power integrity, with zero-shot transfer-ability.
    A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning. (arXiv:2205.13218v1 [cs.LG])
    Real-world applications require the classification model to adapt to new classes without forgetting old ones. Correspondingly, Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement. Typical CIL methods tend to save representative exemplars from former classes to resist forgetting, while recent works find that storing models from history can substantially boost the performance. However, the stored models are not counted into the memory budget, which implicitly results in unfair comparisons. We find that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work, especially for the case with limited memory budgets. As a result, we need to holistically evaluate different CIL methods at different memory scales and simultaneously consider accuracy and memory size for measurement. On the other hand, we dive deeply into the construction of the memory buffer for memory efficiency. By analyzing the effect of different layers in the network, we find that shallow and deep layers have different characteristics in CIL. Motivated by this, we propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel. MEMO extends specialized layers based on the shared generalized representations, efficiently extracting diverse representations with modest cost and maintaining representative exemplars. Extensive experiments on benchmark datasets validate MEMO's competitive performance.
    Continual Feature Selection: Spurious Features in Continual Learning. (arXiv:2203.01012v2 [cs.LG] UPDATED)
    Continual Learning (CL) is the research field addressing learning without forgetting when the data distribution is not static. This paper studies spurious features' influence on continual learning algorithms. We show that continual learning algorithms solve tasks by selecting features that are not generalizable. Our experiments highlight that continual learning algorithms face two related problems: (1) spurious features and (2) local spurious features. The first one is due to a covariate shift between training and testing data, while the second is due to the limited access to data at each training step. We study (1) through a consistent set of continual learning experiments varying spurious correlation amount and data distribution support. We show that (2) is a major cause of performance decrease in continual learning along with catastrophic forgetting. This paper presents a different way of understanding performance decrease in continual learning by highlighting the influence of (local) spurious features in algorithms capabilities.
    QSpeech: Low-Qubit Quantum Speech Application Toolkit. (arXiv:2205.13221v1 [quant-ph])
    Quantum devices with low qubits are common in the Noisy Intermediate-Scale Quantum (NISQ) era. However, Quantum Neural Network (QNN) running on low-qubit quantum devices would be difficult since it is based on Variational Quantum Circuit (VQC), which requires many qubits. Therefore, it is critical to make QNN with VQC run on low-qubit quantum devices. In this study, we propose a novel VQC called the low-qubit VQC. VQC requires numerous qubits based on the input dimension; however, the low-qubit VQC with linear transformation can liberate this condition. Thus, it allows the QNN to run on low-qubit quantum devices for speech applications. Furthermore, as compared to the VQC, our proposed low-qubit VQC can stabilize the training process more. Based on the low-qubit VQC, we implement QSpeech, a library for quick prototyping of hybrid quantum-classical neural networks in the speech field. It has numerous quantum neural layers and QNN models for speech applications. Experiments on Speech Command Recognition and Text-to-Speech show that our proposed low-qubit VQC outperforms VQC and is more stable.
    Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods. (arXiv:2205.11508v2 [cs.LG] UPDATED)
    Self-Supervised Learning (SSL) surmises that inputs and pairwise positive relationships are enough to learn meaningful representations. Although SSL has recently reached a milestone: outperforming supervised methods in many modalities\dots the theoretical foundations are limited, method-specific, and fail to provide principled design guidelines to practitioners. In this paper, we propose a unifying framework under the helm of spectral manifold learning to address those limitations. Through the course of this study, we will rigorously demonstrate that VICReg, SimCLR, BarlowTwins et al. correspond to eponymous spectral methods such as Laplacian Eigenmaps, Multidimensional Scaling et al. This unification will then allow us to obtain (i) the closed-form optimal representation for each method, (ii) the closed-form optimal network parameters in the linear regime for each method, (iii) the impact of the pairwise relations used during training on each of those quantities and on downstream task performances, and most importantly, (iv) the first theoretical bridge between contrastive and non-contrastive methods towards global and local spectral embedding methods respectively, hinting at the benefits and limitations of each. For example, (i) if the pairwise relation is aligned with the downstream task, any SSL method can be employed successfully and will recover the supervised method, but in the low data regime, VICReg's invariance hyper-parameter should be high; (ii) if the pairwise relation is misaligned with the downstream task, VICReg with small invariance hyper-parameter should be preferred over SimCLR or BarlowTwins.
    Mask-based Latent Reconstruction for Reinforcement Learning. (arXiv:2201.12096v2 [cs.LG] UPDATED)
    For deep reinforcement learning (RL) from pixels, learning effective state representations is crucial for achieving high performance. However, in practice, limited experience and high-dimensional input prevent effective representation learning. To address this, motivated by the success of masked modeling in other research fields, we introduce mask-based reconstruction to promote state representation learning in RL. Specifically, we propose a simple yet effective self-supervised method, Mask-based Latent Reconstruction (MLR), to predict the complete state representations in the latent space from the observations with spatially and temporally masked pixels. MLR enables the better use of context information when learning state representations to make them more informative, which facilitates RL agent training. Extensive experiments show that our MLR significantly improves the sample efficiency in RL and outperforms the state-of-the-art sample-efficient RL methods on multiple continuous and discrete control benchmarks. The code will be released soon.
    The Shapley Value in Machine Learning. (arXiv:2202.05594v2 [cs.LG] UPDATED)
    Over the last few years, the Shapley value, a solution concept from cooperative game theory, has found numerous applications in machine learning. In this paper, we first discuss fundamental concepts of cooperative game theory and axiomatic properties of the Shapley value. Then we give an overview of the most important applications of the Shapley value in machine learning: feature selection, explainability, multi-agent reinforcement learning, ensemble pruning, and data valuation. We examine the most crucial limitations of the Shapley value and point out directions for future research.
    Denial-of-Service Attacks on Learned Image Compression. (arXiv:2205.13253v1 [cs.CV])
    Deep learning techniques have shown promising results in image compression, with competitive bitrate and image reconstruction quality from compressed latent. However, while image compression has progressed towards higher peak signal-to-noise ratio (PSNR) and fewer bits per pixel (bpp), their robustness to corner-case images has never received deliberation. In this work, we, for the first time, investigate the robustness of image compression systems where imperceptible perturbation of input images can precipitate a significant increase in the bitrate of their compressed latent. To characterize the robustness of state-of-the-art learned image compression, we mount white and black-box attacks. Our results on several image compression models with various bitrate qualities show that they are surprisingly fragile, where the white-box attack achieves up to 56.326x and black-box 1.947x bpp change. To improve robustness, we propose a novel model which incorporates attention modules and a basic factorized entropy model, resulting in a promising trade-off between the PSNR/bpp ratio and robustness to adversarial attacks that surpasses existing learned image compressors.
    Active Labeling: Streaming Stochastic Gradients. (arXiv:2205.13255v1 [cs.LG])
    The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which generalizes active learning based on partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over number of samples. We illustrate our technique in depth for robust regression.
    Trajectory-Constrained Deep Latent Visual Attention for Improved Local Planning in Presence of Heterogeneous Terrain. (arXiv:2112.04684v3 [cs.RO] UPDATED)
    We present a reward-predictive, model-based deep learning method featuring trajectory-constrained visual attention for local planning in visual navigation tasks. Our method learns to place visual attention at locations in latent image space which follow trajectories caused by vehicle control actions to enhance predictive accuracy during planning. The attention model is jointly optimized by the task-specific loss and an additional trajectory-constraint loss, allowing adaptability yet encouraging a regularized structure for improved generalization and reliability. Importantly, visual attention is applied in latent feature map space instead of raw image space to promote efficient planning. We validated our model in visual navigation tasks of planning low turbulence, collision-free trajectories in off-road settings and hill climbing with locking differentials in the presence of slippery terrain. Experiments involved randomized procedural generated simulation and real-world environments. We found our method improved generalization and learning efficiency when compared to no-attention and self-attention alternatives.
    Orthogonal Stochastic Configuration Networks with Adaptive Construction Parameter for Data Analytics. (arXiv:2205.13191v1 [cs.LG])
    As a randomized learner model, SCNs are remarkable that the random weights and biases are assigned employing a supervisory mechanism to ensure universal approximation and fast learning. However, the randomness makes SCNs more likely to generate approximate linear correlative nodes that are redundant and low quality, thereby resulting in non-compact network structure. In the light of a fundamental principle in machine learning, that is, a model with fewer parameters holds improved generalization. This paper proposes orthogonal SCN, termed OSCN, to filtrate out the low-quality hidden nodes for network structure reduction by incorporating Gram-Schmidt orthogonalization technology. The universal approximation property of OSCN and an adaptive setting for the key construction parameters have been presented in details. In addition, an incremental updating scheme is developed to dynamically determine the output weights, contributing to improved computational efficiency. Finally, experimental results on two numerical examples and several real-world regression and classification datasets substantiate the effectiveness and feasibility of the proposed approach.
    Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers. (arXiv:2107.03996v3 [cs.LG] UPDATED)
    We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method that leverages both proprioceptive states and visual observations for locomotion control. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We transfer our learned policy from simulation to a real robot by running it indoors and in the wild with unseen obstacles and terrain. Our method not only significantly improves over baselines, but also achieves far better generalization performance, especially when transferred to the real robot. Our project page with videos is at https://rchalyang.github.io/LocoTransformer/ .
    $O(N^2)$ Universal Antisymmetry in Fermionic Neural Networks. (arXiv:2205.13205v1 [cs.LG])
    Fermionic neural network (FermiNet) is a recently proposed wavefunction Ansatz, which is used in variational Monte Carlo (VMC) methods to solve the many-electron Schr\"odinger equation. FermiNet proposes permutation-equivariant architectures, on which a Slater determinant is applied to induce antisymmetry. FermiNet is proved to have universal approximation capability with a single determinant, namely, it suffices to represent any antisymmetric function given sufficient parameters. However, the asymptotic computational bottleneck comes from the Slater determinant, which scales with $O(N^3)$ for $N$ electrons. In this paper, we substitute the Slater determinant with a pairwise antisymmetry construction, which is easy to implement and can reduce the computational cost to $O(N^2)$. Furthermore, we formally prove that the pairwise construction built upon permutation-equivariant architectures can universally represent any antisymmetric function.
    On Learning Mixture of Linear Regressions in the Non-Realizable Setting. (arXiv:2205.13166v1 [stat.ML])
    While mixture of linear regressions (MLR) is a well-studied topic, prior works usually do not analyze such models for prediction error. In fact, {\em prediction} and {\em loss} are not well-defined in the context of mixtures. In this paper, first we show that MLR can be used for prediction where instead of predicting a label, the model predicts a list of values (also known as {\em list-decoding}). The list size is equal to the number of components in the mixture, and the loss function is defined to be minimum among the losses resulted by all the component models. We show that with this definition, a solution of the empirical risk minimization (ERM) achieves small probability of prediction error. This begs for an algorithm to minimize the empirical risk for MLR, which is known to be computationally hard. Prior algorithmic works in MLR focus on the {\em realizable} setting, i.e., recovery of parameters when data is probabilistically generated by a mixed linear (noisy) model. In this paper we show that a version of the popular alternating minimization (AM) algorithm finds the best fit lines in a dataset even when a realizable model is not assumed, under some regularity conditions on the dataset and the initial points, and thereby provides a solution for the ERM. We further provide an algorithm that runs in polynomial time in the number of datapoints, and recovers a good approximation of the best fit lines. The two algorithms are experimentally compared.
    AI for Porosity and Permeability Prediction from Geologic Core X-Ray Micro-Tomography. (arXiv:2205.13189v1 [cs.LG])
    Geologic cores are rock samples that are extracted from deep under the ground during the well drilling process. They are used for petroleum reservoirs' performance characterization. Traditionally, physical studies of cores are carried out by the means of manual time-consuming experiments. With the development of deep learning, scientists actively started working on developing machine-learning-based approaches to identify physical properties without any manual experiments. Several previous works used machine learning to determine the porosity and permeability of the rocks, but either method was inaccurate or computationally expensive. We are proposing to use self-supervised pretraining of the very small CNN-transformer-based model to predict the physical properties of the rocks with high accuracy in a time-efficient manner. We show that this technique prevents overfitting even for extremely small datasets.
    RENs: Relevance Encoding Networks. (arXiv:2205.13061v1 [cs.LG])
    The manifold assumption for high-dimensional data assumes that the data is generated by varying a set of parameters obtained from a low-dimensional latent space. Deep generative models (DGMs) are widely used to learn data representations in an unsupervised way. DGMs parameterize the underlying low-dimensional manifold in the data space using bottleneck architectures such as variational autoencoders (VAEs). The bottleneck dimension for VAEs is treated as a hyperparameter that depends on the dataset and is fixed at design time after extensive tuning. As the intrinsic dimensionality of most real-world datasets is unknown, often, there is a mismatch between the intrinsic dimensionality and the latent dimensionality chosen as a hyperparameter. This mismatch can negatively contribute to the model performance for representation learning and sample generation tasks. This paper proposes relevance encoding networks (RENs): a novel probabilistic VAE-based framework that uses the automatic relevance determination (ARD) prior in the latent space to learn the data-specific bottleneck dimensionality. The relevance of each latent dimension is directly learned from the data along with the other model parameters using stochastic gradient descent and a reparameterization trick adapted to non-Gaussian priors. We leverage the concept of DeepSets to capture permutation invariant statistical properties in both data and latent spaces for relevance determination. The proposed framework is general and flexible and can be used for the state-of-the-art VAE models that leverage regularizers to impose specific characteristics in the latent space (e.g., disentanglement). With extensive experimentation on synthetic and public image datasets, we show that the proposed model learns the relevant latent bottleneck dimensionality without compromising the representation and generation quality of the samples.
    Evolutionary scheduling of university activities based on consumption forecasts to minimise electricity costs. (arXiv:2202.12595v2 [cs.LG] UPDATED)
    This paper presents a solution to a predict then optimise problem which goal is to reduce the electricity cost of a university campus. The proposed methodology combines a multi-dimensional time series forecast and a novel approach to large-scale optimization. Gradient-boosting method is applied to forecast both generation and consumption time-series of the Monash university campus for the month of November 2020. For the consumption forecasts we employ log transformation to model trend and stabilize variance. Additional seasonality and trend features are added to the model inputs when applicable. The forecasts obtained are used as the base load for the schedule optimisation of university activities and battery usage. The goal of the optimisation is to minimize the electricity cost consisting of the price of electricity and the peak electricity tariff both altered by the load from class activities and battery use as well as the penalty of not scheduling some optional activities. The schedule of the class activities is obtained through evolutionary optimisation using the covariance matrix adaptation evolution strategy and the genetic algorithm. This schedule is then improved through local search by testing possible times for each activity one-by-one. The battery schedule is formulated as a mixed-integer programming problem and solved by the Gurobi solver. This method obtains the second lowest cost when evaluated against 6 other methods presented at an IEEE competition that all used mixed-integer programming and the Gurobi solver to schedule both the activities and the battery use. The code and data used for the paper are publicly available.
    Symbolic Physics Learner: Discovering governing equations via Monte Carlo tree search. (arXiv:2205.13134v1 [cs.AI])
    Nonlinear dynamics is ubiquitous in nature and commonly seen in various science and engineering disciplines. Distilling analytical expressions that govern nonlinear dynamics from limited data remains vital but challenging. To tackle this fundamental issue, we propose a novel Symbolic Physics Learner (SPL) machine to discover the mathematical structure of nonlinear dynamics. The key concept is to interpret mathematical operations and system state variables by computational rules and symbols, establish symbolic reasoning of mathematical formulas via expression trees, and employ a Monte Carlo tree search (MCTS) agent to explore optimal expression trees based on measurement data. The MCTS agent obtains an optimistic selection policy through the traversal of expression trees, featuring the one that maps to the arithmetic expression of underlying physics. Salient features of the proposed framework include search flexibility and enforcement of parsimony for discovered equations. The efficacy and superiority of the PSL machine are demonstrated by numerical examples, compared with state-of-the-art baselines.
    On the Evolution of A.I. and Machine Learning: Towards Measuring and Understanding Impact, Influence, and Leadership at Premier A.I. Conferences. (arXiv:2205.13131v1 [cs.AI])
    Artificial Intelligence is now recognized as a general-purpose technology with ample impact on human life. In this work, we aim to understand the evolution of AI and Machine learning over the years by analyzing researchers' impact, influence, and leadership over the last decades. This work also intends to shed new light on the history and evolution of AI by exploring the dynamics involved in the field's evolution through the lenses of the papers published on AI conferences since the first International Joint Conference on Artificial Intelligence (IJCAI) in 1969. AI development and evolution have led to increasing research output, reflected in the number of articles published over the last sixty years. We construct comprehensive citation-collaboration and paper-author datasets and compute corresponding centrality measures to carry out our analyses. These analyses allow a better understanding of how AI has reached its current state of affairs in research. Throughout the process, we correlate these datasets with the work of the ACM Turing Award winners and the so-called two AI winters the field has gone through. We also look at self-citation trends and new authors' behaviors. Finally, we present a novel way to infer the country of affiliation of a paper from its organization. Therefore, this work provides a deep analysis of Artificial Intelligence history from information gathered and analyzed from large technical venues datasets and suggests novel insights that can contribute to understanding and measuring AI's evolution.
    Cost-efficient Gaussian Tensor Network Embeddings for Tensor-structured Inputs. (arXiv:2205.13163v1 [math.NA])
    This work discusses tensor network embeddings, which are random matrices ($S$) with tensor network structure. These embeddings have been used to perform dimensionality reduction of tensor network structured inputs $x$ and accelerate applications such as tensor decomposition and kernel regression. Existing works have designed embeddings for inputs $x$ with specific structures, such that the computational cost for calculating $Sx$ is efficient. We provide a systematic way to design tensor network embeddings consisting of Gaussian random tensors, such that for inputs with more general tensor network structures, both the sketch size (row size of $S$) and the sketching computational cost are low. We analyze general tensor network embeddings that can be reduced to a sequence of sketching matrices. We provide a sufficient condition to quantify the accuracy of such embeddings and derive sketching asymptotic cost lower bounds using embeddings that satisfy this condition and have a sketch size lower than any input dimension. We then provide an algorithm to efficiently sketch input data using such embeddings. The sketch size of the embedding used in the algorithm has a linear dependence on the number of sketching dimensions of the input. Assuming tensor contractions are performed with classical dense matrix multiplication algorithms, this algorithm achieves asymptotic cost within a factor of $O(\sqrt{m})$ of our cost lower bound, where $m$ is the sketch size. Further, when each tensor in the input has a dimension that needs to be sketched, this algorithm yields the optimal sketching asymptotic cost. We apply our sketching analysis to inexact tensor decomposition optimization algorithms. We provide a sketching algorithm for CP decomposition that is asymptotically faster than existing work in multiple regimes, and show optimality of an existing algorithm for tensor train rounding.
    Reliably-stabilizing piecewise-affine neural network controllers. (arXiv:2111.07183v3 [eess.SY] UPDATED)
    A common problem affecting neural network (NN) approximations of model predictive control (MPC) policies is the lack of analytical tools to assess the stability of the closed-loop system under the action of the NN-based controller. We present a general procedure to quantify the performance of such a controller, or to design minimum complexity NNs with rectified linear units (ReLUs) that preserve the desirable properties of a given MPC scheme. By quantifying the approximation error between NN-based and MPC-based state-to-input mappings, we first establish suitable conditions involving two key quantities, the worst-case error and the Lipschitz constant, guaranteeing the stability of the closed-loop system. We then develop an offline, mixed-integer optimization-based method to compute those quantities exactly. Together these techniques provide conditions sufficient to certify the stability and performance of a ReLU-based approximation of an MPC control law.
    The Neural Testbed: Evaluating Joint Predictions. (arXiv:2110.04629v3 [cs.LG] UPDATED)
    Predictive distributions quantify uncertainties ignored by point estimates. This paper introduces \textit{The Neural Testbed}: an open-source benchmark for controlled and principled evaluation of agents that generate such predictions. Crucially, the testbed assesses agents not only on the quality of their marginal predictions per input, but also on their joint predictions across many inputs. We evaluate a range of agents using a simple neural network data generating process. Our results indicate that some popular Bayesian deep learning agents do not fare well with joint predictions, even when they can produce accurate marginal predictions. We also show that the quality of joint predictions drives performance in downstream decision tasks. We find these results are robust across choice a wide range of generative models, and highlight the practical importance of joint predictions to the community.
    Deep Generative Modeling for Volume Reconstruction in Cryo-Electron Microscopy. (arXiv:2201.02867v3 [eess.IV] UPDATED)
    Recent breakthroughs in high-resolution imaging of biomolecules in solution with cryo-electron microscopy (cryo-EM) have unlocked new doors for the reconstruction of molecular volumes, thereby promising further advances in biology, chemistry, and pharmacological research. Recent next-generation volume reconstruction algorithms that combine generative modeling with end-to-end unsupervised deep learning techniques have shown promising preliminary results, but still face considerable technical and theoretical hurdles when applied to experimental cryo-EM images. In light of the proliferation of such methods, we propose here a critical review of recent advances in the field of deep generative modeling for cryo-EM volume reconstruction. The present review aims to (i) unify and compare these new methods using a consistent statistical framework, (ii) present them using a terminology familiar to machine learning researchers and computational biologists with no specific background in cryo-EM, and (iii) provide the necessary perspective on current advances to highlight their relative strengths and weaknesses, along with outstanding bottlenecks and avenues for improvements in the field. This review might also raise the interest of computer vision practitioners, as it highlights significant limits of deep generative models in low signal-to-noise regimes -- therefore emphasizing a need for new theoretical and methodological developments.
    Efficient and Near-Optimal Smoothed Online Learning for Generalized Linear Functions. (arXiv:2205.13056v1 [stat.ML])
    Due to the drastic gap in complexity between sequential and batch statistical learning, recent work has studied a smoothed sequential learning setting, where Nature is constrained to select contexts with density bounded by 1/{\sigma} with respect to a known measure {\mu}. Unfortunately, for some function classes, there is an exponential gap between the statistically optimal regret and that which can be achieved efficiently. In this paper, we give a computationally efficient algorithm that is the first to enjoy the statistically optimal log(T/{\sigma}) regret for realizable K-wise linear classification. We extend our results to settings where the true classifier is linear in an over-parameterized polynomial featurization of the contexts, as well as to a realizable piecewise-regression setting assuming access to an appropriate ERM oracle. Somewhat surprisingly, standard disagreement-based analyses are insufficient to achieve regret logarithmic in 1/{\sigma}. Instead, we develop a novel characterization of the geometry of the disagreement region induced by generalized linear classifiers. Along the way, we develop numerous technical tools of independent interest, including a general anti-concentration bound for the determinant of certain matrix averages.
    Matryoshka Representations for Adaptive Deployment. (arXiv:2205.13147v1 [cs.LG])
    Learned representations are a central component in modern ML systems, serving a multitude of downstream tasks. When training such representations, it is often the case that computational and statistical constraints for each downstream task are unknown. In this context rigid, fixed capacity representations can be either over or under-accommodating to the task at hand. This leads us to ask: can we design a flexible representation that can adapt to multiple downstream tasks with varying computational resources? Our main contribution is Matryoshka Representation Learning (MRL) which encodes information at different granularities and allows a single embedding to adapt to the computational constraints of downstream tasks. MRL minimally modifies existing representation learning pipelines and imposes no additional cost during inference and deployment. MRL learns coarse-to-fine representations that are at least as accurate and rich as independently trained low-dimensional representations. The flexibility within the learned Matryoshka Representations offer: (a) up to 14x smaller embedding size for ImageNet-1K classification at the same level of accuracy; (b) up to 14x real-world speed-ups for large-scale retrieval on ImageNet-1K and 4K; and (c) up to 2% accuracy improvements for long-tail few-shot classification, all while being as robust as the original representations. Finally, we show that MRL extends seamlessly to web-scale datasets (ImageNet, JFT) across various modalities -- vision (ViT, ResNet), vision + language (ALIGN) and language (BERT). MRL code and pretrained models are open-sourced at https://github.com/RAIVNLab/MRL.
    Mitigating Memorization of Noisy Labels via Regularization between Representations. (arXiv:2110.09022v3 [cs.LG] UPDATED)
    Designing robust loss functions is popular in learning with noisy labels while existing designs did not explicitly consider the overfitting property of deep neural networks (DNNs). As a result, applying these losses may still suffer from overfitting/memorizing noisy labels as training proceeds. In this paper, we first theoretically analyze the memorization effect and show that a lower-capacity model may perform better on noisy datasets. However, it is non-trivial to design a neural network with the best capacity given an arbitrary task. To circumvent this dilemma, instead of changing the model architecture, we decouple DNNs into an encoder followed by a linear classifier and propose to restrict the function space of a DNN by a representation regularizer. Particularly, we require the distance between two self-supervised features to be positively related to the distance between the corresponding two supervised model outputs. Our proposed framework is easily extendable and can incorporate many other robust loss functions to further improve performance. Extensive experiments and theoretical analyses support our claims. Code is available at github.com/UCSC-REAL/SelfSup_NoisyLabel.
    Joint Synthesis of Safety Certificate and Safe Control Policy using Constrained Reinforcement Learning. (arXiv:2111.07695v3 [cs.LG] UPDATED)
    Safety is the major consideration in controlling complex dynamical systems using reinforcement learning (RL), where the safety certificate can provide provable safety guarantee. A valid safety certificate is an energy function indicating that safe states are with low energy, and there exists a corresponding safe control policy that allows the energy function to always dissipate. The safety certificate and the safe control policy are closely related to each other and both challenging to synthesize. Therefore, existing learning-based studies treat either of them as prior knowledge to learn the other, which limits their applicability with general unknown dynamics. This paper proposes a novel approach that simultaneously synthesizes the energy-function-based safety certificate and learns the safe control policy with CRL. We do not rely on prior knowledge about either an available model-based controller or a perfect safety certificate. In particular, we formulate a loss function to optimize the safety certificate parameters by minimizing the occurrence of energy increases. By adding this optimization procedure as an outer loop to the Lagrangian-based constrained reinforcement learning (CRL), we jointly update the policy and safety certificate parameters and prove that they will converge to their respective local optima, the optimal safe policy and a valid safety certificate. We evaluate our algorithms on multiple safety-critical benchmark environments. The results show that the proposed algorithm learns provably safe policies with no constraint violation. The validity or feasibility of synthesized safety certificate is also verified numerically.
    GraphPMU: Event Clustering via Graph Representation Learning Using Locationally-Scarce Distribution-Level Fundamental and Harmonic PMU Measurements. (arXiv:2205.13116v1 [cs.LG])
    This paper is concerned with the complex task of identifying the type and cause of the events that are captured by distribution-level phasor measurement units (D-PMUs) in order to enhance situational awareness in power distribution systems. Our goal is to address two fundamental challenges in this field: a) scarcity in measurement locations due to the high cost of purchasing, installing, and streaming data from D-PMUs; b) limited prior knowledge about the event signatures due to the fact that the events are diverse, infrequent, and inherently unscheduled. To tackle these challenges, we propose an unsupervised graph-representation learning method, called GraphPMU, to significantly improve the performance in event clustering under locationally-scarce data availability by proposing the following two new directions: 1) using the topological information about the relative location of the few available phasor measurement units on the graph of the power distribution network; 2) utilizing not only the commonly used fundamental phasor measurements, bus also the less explored harmonic phasor measurements in the process of analyzing the signatures of various events. Through a detailed analysis of several case studies, we show that GraphPMU can highly outperform the prevalent methods in the literature.
    Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments. (arXiv:2205.13044v1 [cs.LG])
    We initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions. We start by establishing a lower bound $\Omega((B_{\star} SAT_{\star}(\Delta_c + B_{\star}^2\Delta_P))^{1/3}K^{2/3})$, where $B_{\star}$ is the maximum expected cost of the optimal policy of any episode starting from any state, $T_{\star}$ is the maximum hitting time of the optimal policy of any episode starting from the initial state, $SA$ is the number of state-action pairs, $\Delta_c$ and $\Delta_P$ are the amount of changes of the cost and transition functions respectively, and $K$ is the number of episodes. The different roles of $\Delta_c$ and $\Delta_P$ in this lower bound inspire us to design algorithms that estimate costs and transitions separately. Specifically, assuming the knowledge of $\Delta_c$ and $\Delta_P$, we develop a simple but sub-optimal algorithm and another more involved minimax optimal algorithm (up to logarithmic terms). These algorithms combine the ideas of finite-horizon approximation [Chen et al., 2022a], special Bernstein-style bonuses of the MVP algorithm [Zhang et al., 2020], adaptive confidence widening [Wei and Luo, 2021], as well as some new techniques such as properly penalizing long-horizon policies. Finally, when $\Delta_c$ and $\Delta_P$ are unknown, we develop a variant of the MASTER algorithm [Wei and Luo, 2021] and integrate the aforementioned ideas into it to achieve $\widetilde{O}(\min\{B_{\star} S\sqrt{ALK}, (B_{\star}^2S^2AT_{\star}(\Delta_c+B_{\star}\Delta_P))^{1/3}K^{2/3}\})$ regret, where $L$ is the unknown number of changes of the environment.
    Understanding Metrics for Paraphrasing. (arXiv:2205.13119v1 [cs.CL])
    Paraphrase generation is a difficult problem. This is not only because of the limitations in text generation capabilities but also due that to the lack of a proper definition of what qualifies as a paraphrase and corresponding metrics to measure how good it is. Metrics for evaluation of paraphrasing quality is an on going research problem. Most of the existing metrics in use having been borrowed from other tasks do not capture the complete essence of a good paraphrase, and often fail at borderline-cases. In this work, we propose a novel metric $ROUGE_P$ to measure the quality of paraphrases along the dimensions of adequacy, novelty and fluency. We also provide empirical evidence to show that the current natural language generation metrics are insufficient to measure these desired properties of a good paraphrase. We look at paraphrase model fine-tuning and generation from the lens of metrics to gain a deeper understanding of what it takes to generate and evaluate a good paraphrase.
    A Penalized Shared-parameter Algorithm for Estimating Optimal Dynamic Treatment Regimens. (arXiv:2107.07875v2 [stat.ML] UPDATED)
    A dynamic treatment regimen (DTR) is a set of decision rules to personalize treatments for an individual using their medical history. The Q-learning based Q-shared algorithm has been used to develop DTRs that involve decision rules shared across multiple stages of intervention. We show that the existing Q-shared algorithm can suffer from non-convergence due to the use of linear models in the Q-learning setup, and identify the condition in which Q-shared fails. Leveraging properties from expansion-constrained ordinary least-squares, we give a penalized Q-shared algorithm that not only converges in settings that violate the condition, but can outperform the original Q-shared algorithm even when the condition is satisfied. We give evidence for the proposed method in a real-world application and several synthetic simulations.
    Identifying Patient-Specific Root Causes with the Heteroscedastic Noise Model. (arXiv:2205.13085v1 [stat.ML])
    Complex diseases are caused by a multitude of factors that may differ between patients even within the same diagnostic category. A few underlying root causes may nevertheless initiate the development of disease within each patient. We therefore focus on identifying patient-specific root causes of disease, which we equate to the sample-specific predictivity of the exogenous error terms in a structural equation model. We generalize from the linear setting to the heteroscedastic noise model where $Y = m(X) + \varepsilon\sigma(X)$ with non-linear functions $m(X)$ and $\sigma(X)$ representing the conditional mean and mean absolute deviation, respectively. This model preserves identifiability but introduces non-trivial challenges that require a customized algorithm called Generalized Root Causal Inference (GRCI) to extract the error terms correctly. GRCI recovers patient-specific root causes more accurately than existing alternatives.
    Optimal Neural Network Approximation of Wasserstein Gradient Direction via Convex Optimization. (arXiv:2205.13098v1 [cs.LG])
    The computation of Wasserstein gradient direction is essential for posterior sampling problems and scientific computing. The approximation of the Wasserstein gradient with finite samples requires solving a variational problem. We study the variational problem in the family of two-layer networks with squared-ReLU activations, towards which we derive a semi-definite programming (SDP) relaxation. This SDP can be viewed as an approximation of the Wasserstein gradient in a broader function family including two-layer networks. By solving the convex SDP, we obtain the optimal approximation of the Wasserstein gradient direction in this class of functions. Numerical experiments including PDE-constrained Bayesian inference and parameter estimation in COVID-19 modeling demonstrate the effectiveness of the proposed method.
    Cali3F: Calibrated Fast Fair Federated Recommendation System. (arXiv:2205.13121v1 [cs.IR])
    The increasingly stringent regulations on privacy protection have sparked interest in federated learning. As a distributed machine learning framework, it bridges isolated data islands by training a global model over devices while keeping data localized. Specific to recommendation systems, many federated recommendation algorithms have been proposed to realize the privacy-preserving collaborative recommendation. However, several constraints remain largely unexplored. One big concern is how to ensure fairness between participants of federated learning, that is, to maintain the uniformity of recommendation performance across devices. On the other hand, due to data heterogeneity and limited networks, additional challenges occur in the convergence speed. To address these problems, in this paper, we first propose a personalized federated recommendation system training algorithm to improve the recommendation performance fairness. Then we adopt a clustering-based aggregation method to accelerate the training process. Combining the two components, we proposed Cali3F, a calibrated fast and fair federated recommendation framework. Cali3F not only addresses the convergence problem by a within-cluster parameter sharing approach but also significantly boosts fairness by calibrating local models with the global model. We demonstrate the performance of Cali3F across standard benchmark datasets and explore the efficacy in comparison to traditional aggregation approaches.
    QGNN: Value Function Factorisation with Graph Neural Networks. (arXiv:2205.13005v1 [cs.LG])
    In multi-agent reinforcement learning, the use of a global objective is a powerful tool for incentivising cooperation. Unfortunately, it is not sample-efficient to train individual agents with a global reward, because it does not necessarily correlate with an agent's individual actions. This problem can be solved by factorising the global value function into local value functions. Early work in this domain performed factorisation by conditioning local value functions purely on local information. Recently, it has been shown that providing both local information and an encoding of the global state can promote cooperative behaviour. In this paper we propose QGNN, the first value factorisation method to use a graph neural network (GNN) based model. The multi-layer message passing architecture of QGNN provides more representational complexity than models in prior work, allowing it to produce a more effective factorisation. QGNN also introduces a permutation invariant mixer which is able to match the performance of other methods, even with significantly fewer parameters. We evaluate our method against several baselines, including QMIX-Att, GraphMIX, QMIX, VDN, and hybrid architectures. Our experiments include Starcraft, the standard benchmark for credit assignment; Estimate Game, a custom environment that explicitly models inter-agent dependencies; and Coalition Structure Generation, a foundational problem with real-world applications. The results show that QGNN outperforms state-of-the-art value factorisation baselines consistently.
    Urban Rhapsody: Large-scale exploration of urban soundscapes. (arXiv:2205.13064v1 [cs.CY])
    Noise is one of the primary quality-of-life issues in urban environments. In addition to annoyance, noise negatively impacts public health and educational performance. While low-cost sensors can be deployed to monitor ambient noise levels at high temporal resolutions, the amount of data they produce and the complexity of these data pose significant analytical challenges. One way to address these challenges is through machine listening techniques, which are used to extract features in attempts to classify the source of noise and understand temporal patterns of a city's noise situation. However, the overwhelming number of noise sources in the urban environment and the scarcity of labeled data makes it nearly impossible to create classification models with large enough vocabularies that capture the true dynamism of urban soundscapes In this paper, we first identify a set of requirements in the yet unexplored domain of urban soundscape exploration. To satisfy the requirements and tackle the identified challenges, we propose Urban Rhapsody, a framework that combines state-of-the-art audio representation, machine learning, and visual analytics to allow users to interactively create classification models, understand noise patterns of a city, and quickly retrieve and label audio excerpts in order to create a large high-precision annotated database of urban sound recordings. We demonstrate the tool's utility through case studies performed by domain experts using data generated over the five-year deployment of a one-of-a-kind sensor network in New York City.
    Semi-supervised Drifted Stream Learning with Short Lookback. (arXiv:2205.13066v1 [cs.LG])
    In many scenarios, 1) data streams are generated in real time; 2) labeled data are expensive and only limited labels are available in the beginning; 3) real-world data is not always i.i.d. and data drift over time gradually; 4) the storage of historical streams is limited and model updating can only be achieved based on a very short lookback window. This learning setting limits the applicability and availability of many Machine Learning (ML) algorithms. We generalize the learning task under such setting as a semi-supervised drifted stream learning with short lookback problem (SDSL). SDSL imposes two under-addressed challenges on existing methods in semi-supervised learning, continuous learning, and domain adaptation: 1) robust pseudo-labeling under gradual shifts and 2) anti-forgetting adaptation with short lookback. To tackle these challenges, we propose a principled and generic generation-replay framework to solve SDSL. The framework is able to accomplish: 1) robust pseudo-labeling in the generation step; 2) anti-forgetting adaption in the replay step. To achieve robust pseudo-labeling, we develop a novel pseudo-label classification model to leverage supervised knowledge of previously labeled data, unsupervised knowledge of new data, and, structure knowledge of invariant label semantics. To achieve adaptive anti-forgetting model replay, we propose to view the anti-forgetting adaptation task as a flat region search problem. We propose a novel minimax game-based replay objective function to solve the flat region search problem and develop an effective optimization solver. Finally, we present extensive experiments to demonstrate our framework can effectively address the task of anti-forgetting learning in drifted streams with short lookback.
    Factorized Structured Regression for Large-Scale Varying Coefficient Models. (arXiv:2205.13080v1 [stat.ML])
    Recommender Systems (RS) pervade many aspects of our everyday digital life. Proposed to work at scale, state-of-the-art RS allow the modeling of thousands of interactions and facilitate highly individualized recommendations. Conceptually, many RS can be viewed as instances of statistical regression models that incorporate complex feature effects and potentially non-Gaussian outcomes. Such structured regression models, including time-aware varying coefficients models, are, however, limited in their applicability to categorical effects and inclusion of a large number of interactions. Here, we propose Factorized Structured Regression (FaStR) for scalable varying coefficient models. FaStR overcomes limitations of general regression models for large-scale data by combining structured additive regression and factorization approaches in a neural network-based model implementation. This fusion provides a scalable framework for the estimation of statistical models in previously infeasible data settings. Empirical results confirm that the estimation of varying coefficients of our approach is on par with state-of-the-art regression techniques, while scaling notably better and also being competitive with other time-aware RS in terms of prediction performance. We illustrate FaStR's performance and interpretability on a large-scale behavioral study with smartphone user data.
    Transferable Adversarial Attack based on Integrated Gradients. (arXiv:2205.13152v1 [cs.LG])
    The vulnerability of deep neural networks to adversarial examples has drawn tremendous attention from the community. Three approaches, optimizing standard objective functions, exploiting attention maps, and smoothing decision surfaces, are commonly used to craft adversarial examples. By tightly integrating the three approaches, we propose a new and simple algorithm named Transferable Attack based on Integrated Gradients (TAIG) in this paper, which can find highly transferable adversarial examples for black-box attacks. Unlike previous methods using multiple computational terms or combining with other methods, TAIG integrates the three approaches into one single term. Two versions of TAIG that compute their integrated gradients on a straight-line path and a random piecewise linear path are studied. Both versions offer strong transferability and can seamlessly work together with the previous methods. Experimental results demonstrate that TAIG outperforms the state-of-the-art methods. The code will available at https://github.com/yihuang2016/TAIG
    Learning to Query Internet Text for Informing Reinforcement Learning Agents. (arXiv:2205.13079v1 [cs.LG])
    Generalization to out of distribution tasks in reinforcement learning is a challenging problem. One successful approach improves generalization by conditioning policies on task or environment descriptions that provide information about the current transition or reward functions. Previously, these descriptions were often expressed as generated or crowd sourced text. In this work, we begin to tackle the problem of extracting useful information from natural language found in the wild (e.g. internet forums, documentation, and wikis). These natural, pre-existing sources are especially challenging, noisy, and large and present novel challenges compared to previous approaches. We propose to address these challenges by training reinforcement learning agents to learn to query these sources as a human would, and we experiment with how and when an agent should query. To address the \textit{how}, we demonstrate that pretrained QA models perform well at executing zero-shot queries in our target domain. Using information retrieved by a QA model, we train an agent to learn \textit{when} it should execute queries. We show that our method correctly learns to execute queries to maximize reward in a reinforcement learning setting.
    TSEM: Temporally Weighted Spatiotemporal Explainable Neural Network for Multivariate Time Series. (arXiv:2205.13012v1 [cs.LG])
    Deep learning has become a one-size-fits-all solution for technical and business domains thanks to its flexibility and adaptability. It is implemented using opaque models, which unfortunately undermines the outcome trustworthiness. In order to have a better understanding of the behavior of a system, particularly one driven by time series, a look inside a deep learning model so-called posthoc eXplainable Artificial Intelligence (XAI) approaches, is important. There are two major types of XAI for time series data, namely model-agnostic and model-specific. Model-specific approach is considered in this work. While other approaches employ either Class Activation Mapping (CAM) or Attention Mechanism, we merge the two strategies into a single system, simply called the Temporally Weighted Spatiotemporal Explainable Neural Network for Multivariate Time Series (TSEM). TSEM combines the capabilities of RNN and CNN models in such a way that RNN hidden units are employed as attention weights for the CNN feature maps temporal axis. The result shows that TSEM outperforms XCM. It is similar to STAM in terms of accuracy, while also satisfying a number of interpretability criteria, including causality, fidelity, and spatiotemporality.
    Preference Dynamics Under Personalized Recommendations. (arXiv:2205.13026v1 [cs.LG])
    Many projects (both practical and academic) have designed algorithms to match users to content they will enjoy under the assumption that user's preferences and opinions do not change with the content they see. Evidence suggests that individuals' preferences are directly shaped by what content they see -- radicalization, rabbit holes, polarization, and boredom are all example phenomena of preferences affected by content. Polarization in particular can occur even in ecosystems with "mass media," where no personalization takes place, as recently explored in a natural model of preference dynamics by~\citet{hkazla2019geometric} and~\citet{gaitonde2021polarization}. If all users' preferences are drawn towards content they already like, or are repelled from content they already dislike, uniform consumption of media leads to a population of heterogeneous preferences converging towards only two poles. In this work, we explore whether some phenomenon akin to polarization occurs when users receive \emph{personalized} content recommendations. We use a similar model of preference dynamics, where an individual's preferences move towards content the consume and enjoy, and away from content they consume and dislike. We show that standard user reward maximization is an almost trivial goal in such an environment (a large class of simple algorithms will achieve only constant regret). A more interesting objective, then, is to understand under what conditions a recommendation algorithm can ensure stationarity of user's preferences. We show how to design a content recommendations which can achieve approximate stationarity, under mild conditions on the set of available content, when a user's preferences are known, and how one can learn enough about a user's preferences to implement such a strategy even when user preferences are initially unknown.
    Trainable Weight Averaging for Fast Convergence and Better Generalization. (arXiv:2205.13104v1 [cs.LG])
    Stochastic gradient descent (SGD) and its variants are commonly considered as the de-facto methods to train deep neural networks (DNNs). While recent improvements to SGD mainly focus on the descent algorithm itself, few works pay attention to utilizing the historical solutions -- as an iterative method, SGD has actually gone through substantial explorations before its final convergence. Recently, an interesting attempt is stochastic weight averaging (SWA), which significantly improves the generalization by simply averaging the solutions at the tail stage of training. In this paper, we propose to optimize the averaging coefficients, leading to our Trainable Weight Averaging (TWA), essentially a novel training method in a reduced subspace spanned by historical solutions. TWA is quite efficient and has good generalization capability as the degree of freedom for training is small. It largely reduces the estimation error from SWA, making it not only further improve the SWA solutions but also take full advantage of the solutions generated in the head of training where SWA fails. In the extensive numerical experiments, (i) TWA achieves consistent improvements over SWA with less sensitivity to learning rate; (ii) applying TWA in the head stage of training largely speeds up the convergence, resulting in over 40% time saving on CIFAR and 30% on ImageNet with improved generalization compared with regular training. The code is released at https://github.com/nblt/TWA.
    Designing an Efficient End-to-end Machine Learning Pipeline for Real-time Empty-shelf Detection. (arXiv:2205.13060v1 [cs.LG])
    On-Shelf Availability (OSA) of products in retail stores is a critical business criterion in the fast moving consumer goods and retails sector. When a product is out-of-stock (OOS) and a customer cannot find it on its designed shelf, this causes a negative impact on the customer's behaviors and future demands. Several methods are being adopted by retailers today to detect empty shelves and ensure high OSA of products; however, such methods are generally ineffective and infeasible since they are either manual, expensive or less accurate. Recently machine learning based solutions have been proposed, but they suffer from high computation cost and low accuracy problem due to lack of large annotated datasets of on-shelf products. Here, we present an elegant approach for designing an end-to-end machine learning (ML) pipeline for real-time empty shelf detection. Considering the strong dependency between the quality of ML models and the quality of data, we focus on the importance of proper data collection, cleaning and correct data annotation before delving into modeling. Since an empty-shelf detection solution should be computationally-efficient for real-time predictions, we explore different run-time optimizations to improve the model performance. Our dataset contains 1000 images, collected and annotated by following well-defined guidelines. Our low-latency model achieves a mean average F1-score of 68.5%, and can process up to 67 images/s on Intel Xeon Gold and up to 860 images/s on an A100 GPU. Our annotated dataset is publicly available along with our optimized models.
    BRIGHT -- Graph Neural Networks in Real-Time Fraud Detection. (arXiv:2205.13084v1 [cs.LG])
    Detecting fraudulent transactions is an essential component to control risk in e-commerce marketplaces. Apart from rule-based and machine learning filters that are already deployed in production, we want to enable efficient real-time inference with graph neural networks (GNNs), which is useful to catch multihop risk propagation in a transaction graph. However, two challenges arise in the implementation of GNNs in production. First, future information in a dynamic graph should not be considered in message passing to predict the past. Second, the latency of graph query and GNN model inference is usually up to hundreds of milliseconds, which is costly for some critical online services. To tackle these challenges, we propose a Batch and Real-time Inception GrapH Topology (BRIGHT) framework to conduct an end-to-end GNN learning that allows efficient online real-time inference. BRIGHT framework consists of a graph transformation module (Two-Stage Directed Graph) and a corresponding GNN architecture (Lambda Neural Network). The Two-Stage Directed Graph guarantees that the information passed through neighbors is only from the historical payment transactions. It consists of two subgraphs representing historical relationships and real-time links, respectively. The Lambda Neural Network decouples inference into two stages: batch inference of entity embeddings and real-time inference of transaction prediction. Our experiments show that BRIGHT outperforms the baseline models by >2\% in average w.r.t.~precision. Furthermore, BRIGHT is computationally efficient for real-time fraud detection. Regarding end-to-end performance (including neighbor query and inference), BRIGHT can reduce the P99 latency by >75\%. For the inference stage, our speedup is on average 7.8$\times$ compared to the traditional GNN.
    EvoVGM: A Deep Variational Generative Model for Evolutionary Parameter Estimation. (arXiv:2205.13034v1 [cs.LG])
    Most evolutionary-oriented deep generative models do not explicitly consider the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. In this study, we propose a method for a deep variational Bayesian generative model that jointly approximates the true posterior of local biological evolutionary parameters and generates sequence alignments. Moreover, it is instantiated and tuned for continuous-time Markov chain substitution models such as JC69 and GTR. We train the model via a low-variance variational objective function and a gradient ascent algorithm. Here, we show the consistency and effectiveness of the method on synthetic sequence alignments simulated with several evolutionary scenarios and on a real virus sequence alignment.
    Green Hierarchical Vision Transformer for Masked Image Modeling. (arXiv:2205.13515v1 [cs.CV])
    We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), e.g., Swin Transformer, allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of two key components. First, for the window attention, we design a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. As a result, MIM now can work on hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs about 2.7$\times$ faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks. Code and pre-trained models have been made publicly available at https://github.com/LayneH/GreenMIM.
    BiT: Robustly Binarized Multi-distilled Transformer. (arXiv:2205.13016v1 [cs.LG])
    Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however is technically challenging from an optimization perspective. In this work, we identify a series of improvements which enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%.
    People counting system for retail analytics using edge AI. (arXiv:2205.13020v1 [cs.LG])
    Developments in IoT applications are playing an important role in our day-to-day life, starting from business predictions to self driving cars. One of the area, most influenced by the field of AI and IoT is retail analytics. In Retail Analytics, Conversion Rates - a metric which is most often used by retail stores to measure how many people have visited the store and how many purchases has happened. This retail conversion rate assess the marketing operations, increasing stock, store outlet and running promotions ..etc. Our project intends to build a cost-effective people counting system with AI at Edge, where it calculates Conversion rates using total number of people counted by the system and number of transactions for the day, which helps in providing analytical insights for retail store optimization with a very minimum hardware requirements.
    Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification. (arXiv:2205.13094v1 [cs.LG])
    While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an $\textit{undersampled}$ dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in real-world datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. While in the case of group-covariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory the test accuracy of robust neural network classifiers is constrained by the number of minority samples.
    Improving Subgraph Representation Learning via Multi-View Augmentation. (arXiv:2205.13038v1 [cs.LG])
    Subgraph representation learning based on Graph Neural Network (GNN) has broad applications in chemistry and biology, such as molecule property prediction and gene collaborative function prediction. On the other hand, graph augmentation techniques have shown promising results in improving graph-based and node-based classification tasks but are rarely explored in the GNN-based subgraph representation learning literature. In this work, we developed a novel multiview augmentation mechanism to improve subgraph representation learning and thus the accuracy of downstream prediction tasks. The augmentation technique creates multiple variants of subgraphs and embeds these variants into the original graph to achieve both high training efficiency, scalability, and improved accuracy. Experiments on several real-world subgraph benchmarks demonstrate the superiority of our proposed multi-view augmentation techniques.
    Online Deep Equilibrium Learning for Regularization by Denoising. (arXiv:2205.13051v1 [eess.IV])
    Plug-and-Play Priors (PnP) and Regularization by Denoising (RED) are widely-used frameworks for solving imaging inverse problems by computing fixed-points of operators combining physical measurement models and learned image priors. While traditional PnP/RED formulations have focused on priors specified using image denoisers, there is a growing interest in learning PnP/RED priors that are end-to-end optimal. The recent Deep Equilibrium Models (DEQ) framework has enabled memory-efficient end-to-end learning of PnP/RED priors by implicitly differentiating through the fixed-point equations without storing intermediate activation values. However, the dependence of the computational/memory complexity of the measurement models in PnP/RED on the total number of measurements leaves DEQ impractical for many imaging applications. We propose ODER as a new strategy for improving the efficiency of DEQ through stochastic approximations of the measurement models. We theoretically analyze ODER giving insights into its convergence and ability to approximate the traditional DEQ approach. Our numerical results suggest the potential improvements in training/testing complexity due to ODER on three distinct imaging applications.
    Scalable and Low-Latency Federated Learning with Cooperative Mobile Edge Networking. (arXiv:2205.13054v1 [cs.DC])
    Federated learning (FL) enables collaborative model training without centralizing data. However, the traditional FL framework is cloud-based and suffers from high communication latency. On the other hand, the edge-based FL framework that relies on an edge server co-located with access point for model aggregation has low communication latency but suffers from degraded model accuracy due to the limited coverage of edge server. In light of high-accuracy but high-latency cloud-based FL and low-latency but low-accuracy edge-based FL, this paper proposes a new FL framework based on cooperative mobile edge networking called cooperative federated edge learning (CFEL) to enable both high-accuracy and low-latency distributed intelligence at mobile edge networks. Considering the unique two-tier network architecture of CFEL, a novel federated optimization method dubbed cooperative edge-based federated averaging (CE-FedAvg) is further developed, wherein each edge server both coordinates collaborative model training among the devices within its own coverage and cooperates with other edge servers to learn a shared global model through decentralized consensus. Experimental results based on benchmark datasets show that CFEL can largely speed up the convergence speed and reduce the training time to achieve a target model accuracy compared with prior FL frameworks.
    Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimality. (arXiv:2205.13521v1 [cs.AI])
    Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose DOMiNO, a method for Diversity Optimization Maintaining Near Optimality. We formalize the problem as a Constrained Markov Decision Process where the objective is to find diverse policies, measured by the distance between the state occupancies of the policies in the set, while remaining near-optimal with respect to the extrinsic reward. We demonstrate that the method can discover diverse and meaningful behaviors in various domains, such as different locomotion patterns in the DeepMind Control Suite. We perform extensive analysis of our approach, compare it with other multi-objective baselines, demonstrate that we can control both the quality and the diversity of the set via interpretable hyperparameters, and show that the discovered set is robust to perturbations.
    Forest Fire Clustering for Single-cell Sequencing with Iterative Label Propagation and Parallelized Monte Carlo Simulation. (arXiv:2103.11802v4 [cs.LG] UPDATED)
    In the era of single-cell sequencing, there is a growing need to extract insights from data with clustering methods. Here, we introduce Forest Fire Clustering, an efficient and interpretable method for cell-type discovery from single-cell data. Forest Fire Clustering makes minimal prior assumptions and, different from current approaches, calculates a non-parametric posterior probability that each cell is assigned a cell-type label. These posterior distributions allow for the evaluation of a label confidence for each cell and enable the computation of "label entropies," highlighting transitions along developmental trajectories. Furthermore, we show that Forest Fire Clustering can make robust, inductive inferences in an online-learning context and can readily scale to millions of cells. Finally, we demonstrate that our method outperforms state-of-the-art clustering approaches on diverse benchmarks of simulated and experimental data. Overall, Forest Fire Clustering is a useful tool for rare cell type discovery in large-scale single-cell analysis.
    Comparison of Traditional and Hybrid Time Series Models for Forecasting COVID-19 Cases. (arXiv:2105.03266v2 [cs.SI] UPDATED)
    Time series forecasting methods play critical role in estimating the spread of an epidemic. The coronavirus outbreak of December 2019 has already infected millions all over the world and continues to spread on. Just when the curve of the outbreak had started to flatten, many countries have again started to witness a rise in cases which is now being referred as the 2nd wave of the pandemic. A thorough analysis of time-series forecasting models is therefore required to equip state authorities and health officials with immediate strategies for future times. This aims of the study are three-fold: (a) To model the overall trend of the spread; (b) To generate a short-term forecast of 10 days in countries with the highest incidence of confirmed cases (USA, India and Brazil); (c) To quantitatively determine the algorithm that is best suited for precise modelling of the linear and non-linear features of the time series. The comparison of forecasting models for the total cumulative cases of each country is carried out by comparing the reported data and the predicted value, and then ranking the algorithms (Prophet, Holt-Winters, LSTM, ARIMA, and ARIMA-NARNN) based on their RMSE, MAE and MAPE values. The hybrid combination of ARIMA and NARNN (Nonlinear Auto-Regression Neural Network) gave the best result among the selected models with a reduced RMSE, which proved to be almost 35.3% better than one of the most prevalent method of time-series prediction (ARIMA). The results demonstrated the efficacy of the hybrid implementation of the ARIMA-NARNN model over other forecasting methods such as Prophet, Holt Winters, LSTM, and the ARIMA model in encapsulating the linear as well as non-linear patterns of the epidemical datasets.
    Kernel Ridgeless Regression is Inconsistent for Low Dimensions. (arXiv:2205.13525v1 [cs.LG])
    We show that kernel interpolation for a large class of shift-invariant kernels is inconsistent in fixed dimension, even with bandwidth adaptive to the training set.
    Continual evaluation for lifelong learning: Identifying the stability gap. (arXiv:2205.13452v1 [cs.LG])
    Introducing a time dependency on the data generating distribution has proven to be difficult for gradient-based training of neural networks, as the greedy updates result in catastrophic forgetting of previous timesteps. Continual learning aims to overcome the greedy optimization to enable continuous accumulation of knowledge over time. The data stream is typically divided into locally stationary distributions, called tasks, allowing task-based evaluation on held-out data from the training tasks. Contemporary evaluation protocols and metrics in continual learning are task-based and quantify the trade-off between stability and plasticity only at task transitions. However, our empirical evidence suggests that between task transitions significant, temporary forgetting can occur, remaining unidentified in task-based evaluation. Therefore, we propose a framework for continual evaluation that establishes per-iteration evaluation and define a new set of metrics that enables identifying the worst-case performance of the learner over its lifetime. Performing continual evaluation, we empirically identify that replay suffers from a stability gap: upon learning a new task, there is a substantial but transient decrease in performance on past tasks. Further conceptual and empirical analysis suggests not only replay-based, but also regularization-based continual learning methods are prone to the stability gap.
    Are Transformers Effective for Time Series Forecasting?. (arXiv:2205.13504v1 [cs.AI])
    Recently, there has been a surge of Transformer-based solutions for the time series forecasting (TSF) task, especially for the challenging long-term TSF problem. Transformer architecture relies on self-attention mechanisms to effectively extract the semantic correlations between paired elements in a long sequence, which is permutation-invariant and anti-ordering to some extent. However, in time series modeling, we are to extract the temporal relations among an ordering set of continuous points. Consequently, whether Transformer-based techniques are the right solutions for long-term time series forecasting is an interesting problem to investigate, despite the performance improvements shown in these studies. In this work, we question the validity of Transformer-based TSF solutions. In their experiments, the compared (non-Transformer) baselines are mainly autoregressive forecasting solutions, which usually have a poor long-term prediction capability due to inevitable error accumulation effects. In contrast, we use an embarrassingly simple architecture named DLinear that conducts direct multi-step (DMS) forecasting for comparison. DLinear decomposes the time series into a trend and a remainder series and employs two one-layer linear networks to model these two series for the forecasting task. Surprisingly, it outperforms existing complex Transformer-based models in most cases by a large margin. Therefore, we conclude that the relatively higher long-term forecasting accuracy of Transformer-based TSF solutions shown in existing works has little to do with the temporal relation extraction capabilities of the Transformer architecture. Instead, it is mainly due to the non-autoregressive DMS forecasting strategy used in them. We hope this study also advocates revisiting the validity of Transformer-based solutions for other time series analysis tasks (e.g., anomaly detection) in the future.
    Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency. (arXiv:2205.13476v1 [cs.LG])
    Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/\epsilon^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.
    Your Transformer May Not be as Powerful as You Expect. (arXiv:2205.13401v1 [cs.LG])
    Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative -- RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. One key reason lies in that most RPEs are placed in the softmax attention that always generates a right stochastic matrix. This restricts the network from capturing positional information in the RPEs and limits its capacity. To overcome the problem and make the model more powerful, we first present sufficient conditions for RPE-based Transformers to achieve universal function approximation. With the theoretical guidance, we develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions. Therefore, the corresponding URPE-based Transformers become universal function approximators. Extensive experiments covering typical architectures and tasks demonstrate that our model is parameter-efficient and can achieve superior performance to strong baselines in a wide range of applications.
    Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback. (arXiv:2205.13451v1 [cs.LG])
    We consider regret minimization for Adversarial Markov Decision Processes (AMDPs), where the loss functions are changing over time and adversarially chosen, and the learner only observes the losses for the visited state-action pairs (i.e., bandit feedback). While there has been a surge of studies on this problem using Online-Mirror-Descent (OMD) methods, very little is known about the Follow-the-Perturbed-Leader (FTPL) methods, which are usually computationally more efficient and also easier to implement since it only requires solving an offline planning problem. Motivated by this, we take a closer look at FTPL for learning AMDPs, starting from the standard episodic finite-horizon setting. We find some unique and intriguing difficulties in the analysis and propose a workaround to eventually show that FTPL is also able to achieve near-optimal regret bounds in this case. More importantly, we then find two significant applications: First, the analysis of FTPL turns out to be readily generalizable to delayed bandit feedback with order-optimal regret, while OMD methods exhibit extra difficulties (Jin et al., 2022). Second, using FTPL, we also develop the first no-regret algorithm for learning communicating AMDPs in the infinite-horizon setting with bandit feedback and stochastic transitions. Our algorithm is efficient assuming access to an offline planning oracle, while even for the easier full-information setting, the only existing algorithm (Chandrasekaran and Tewari, 2021) is computationally inefficient.
    The Neuro-Symbolic Brain. (arXiv:2205.13440v1 [cs.NE])
    Neural networks promote a distributed representation with no clear place for symbols. Despite this, we propose that symbols are manufactured simply by training a sparse random noise as a self-sustaining attractor in a feedback spiking neural network. This way, we can generate many of what we shall call prime attractors, and the networks that support them are like registers holding a symbolic value, and we call them registers. Like symbols, prime attractors are atomic and devoid of any internal structure. Moreover, the winner-take-all mechanism naturally implemented by spiking neurons enables registers to recover a prime attractor within a noisy signal. Using this faculty, when considering two connected registers, an input one and an output one, it is possible to bind in one shot using a Hebbian rule the attractor active on the output to the attractor active on the input. Thus, whenever an attractor is active on the input, it induces its bound attractor on the output; even though the signal gets blurrier with more bindings, the winner-take-all filtering faculty can recover the bound prime attractor. However, the capacity is still limited. It is also possible to unbind in one shot, restoring the capacity taken by that binding. This mechanism serves as a basis for working memory, turning prime attractors into variables. Also, we use a random second-order network to amalgamate the prime attractors held by two registers to bind the prime attractor held by a third register to them in one shot, de facto implementing a hash table. Furthermore, we introduce the register switch box composed of registers to move the content of one register to another. Then, we use spiking neurons to build a toy symbolic computer based on the above. The technics used suggest ways to design extrapolating, reusable, sample-efficient deep learning networks at the cost of structural priors.
    Principled Knowledge Extrapolation with GANs. (arXiv:2205.13444v1 [cs.LG])
    Human can extrapolate well, generalize daily knowledge into unseen scenarios, raise and answer counterfactual questions. To imitate this ability via generative models, previous works have extensively studied explicitly encoding Structural Causal Models (SCMs) into architectures of generator networks. This methodology, however, limits the flexibility of the generator as they must be carefully crafted to follow the causal graph, and demands a ground truth SCM with strong ignorability assumption as prior, which is a nontrivial assumption in many real scenarios. Thus, many current causal GAN methods fail to generate high fidelity counterfactual results as they cannot easily leverage state-of-the-art generative models. In this paper, we propose to study counterfactual synthesis from a new perspective of knowledge extrapolation, where a given knowledge dimension of the data distribution is extrapolated, but the remaining knowledge is kept indistinguishable from the original distribution. We show that an adversarial game with a closed-form discriminator can be used to address the knowledge extrapolation problem, and a novel principal knowledge descent method can efficiently estimate the extrapolated distribution through the adversarial game. Our method enjoys both elegant theoretical guarantees and superior performance in many scenarios.
    A framework for overparameterized learning. (arXiv:2205.13507v1 [cs.LG])
    An explanation for the success of deep neural networks is a central question in theoretical machine learning. According to classical statistical learning, the overparameterized nature of such models should imply a failure to generalize. Many argue that good empirical performance is due to the implicit regularization of first order optimization methods. In particular, the Polyak-{\L}ojasiewicz condition leads to gradient descent finding a global optimum that is close to initialization. In this work, we propose a framework consisting of a prototype learning problem, which is general enough to cover many popular problems and even the cases of infinitely wide neural networks and infinite data. We then perform an analysis from the perspective of the Polyak-{\L}ojasiewicz condition. We obtain theoretical results of independent interest, concerning gradient descent on a composition $(f \circ F): G \to \mathbb{R}$ of functions $F: G \to H$ and $f: H \to \mathbb{R}$ with $G, H$ being Hilbert spaces. Building on these results, we determine the properties that have to be satisfied by the components of the prototype problem for gradient descent to find a global optimum that is close to initialization. We then demonstrate that supervised learning, variational autoencoders and training with gradient penalty can be translated to the prototype problem. Finally, we lay out a number of directions for future research.
    A Rotated Hyperbolic Wrapped Normal Distribution for Hierarchical Representation Learning. (arXiv:2205.13371v1 [cs.LG])
    We present a rotated hyperbolic wrapped normal distribution (RoWN), a simple yet effective alteration of a hyperbolic wrapped normal distribution (HWN). The HWN expands the domain of probabilistic modeling from Euclidean to hyperbolic space, where a tree can be embedded with arbitrary low distortion in theory. In this work, we analyze the geometric properties of the diagonal HWN, a standard choice of distribution in probabilistic modeling. The analysis shows that the distribution is inappropriate to represent the data points at the same hierarchy level through their angular distance with the same norm in the Poincar\'e disk model. We then empirically verify the presence of limitations of HWN, and show how RoWN, the newly proposed distribution, can alleviate the limitations on various hierarchical datasets, including noisy synthetic binary tree, WordNet, and Atari 2600 Breakout.
    Multi-fidelity power flow solver. (arXiv:2205.13362v1 [cs.LG])
    We propose a multi-fidelity neural network (MFNN) tailored for rapid high-dimensional grid power flow simulations and contingency analysis with scarce high-fidelity contingency data. The proposed model comprises two networks -- the first one trained on DC approximation as low-fidelity data and coupled to a high-fidelity neural net trained on both low- and high-fidelity power flow data. Each network features a latent module which parametrizes the model by a discrete grid topology vector for generalization (e.g., $n$ power lines with $k$ disconnections or contingencies, if any), and the targeted high-fidelity output is a weighted sum of linear and nonlinear functions. We tested the model on 14- and 118-bus test cases and evaluated its performance based on the $n-k$ power flow prediction accuracy with respect to imbalanced contingency data and high-to-low-fidelity sample ratio. The results presented herein demonstrate MFNN's potential and its limits with up to two orders of magnitude faster and more accurate power flow solutions than DC approximation.
    Constrained Reinforcement Learning for Short Video Recommendation. (arXiv:2205.13248v1 [cs.LG])
    The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users provide complex and multi-faceted responses towards recommendations, including watch time and various types of interactions with videos. As a result, established recommendation algorithms that concern a single objective are not adequate to meet this new demand of optimizing comprehensive user experiences. In this paper, we formulate the problem of short video recommendation as a constrained Markov Decision Process (MDP), where platforms want to optimize the main goal of user watch time in long term, with the constraint of accommodating the auxiliary responses of user interactions such as sharing/downloading videos. To solve the constrained MDP, we propose a two-stage reinforcement learning approach based on actor-critic framework. At stage one, we learn individual policies to optimize each auxiliary response. At stage two, we learn a policy to (i) optimize the main response and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive simulations, we demonstrate effectiveness of our approach over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our approach in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of watch time and interactions from video views. Our approach has been fully launched in the production system to optimize user experiences on the platform.
    Learning to Accelerate by the Methods of Step-size Planning. (arXiv:2204.01705v4 [cs.LG] UPDATED)
    Gradient descent is slow to converge for ill-conditioned problems and non-convex problems. An important technique for acceleration is step-size adaptation. The first part of this paper contains a detailed review of step-size adaptation methods, including Polyak step-size, L4, LossGrad, Adam, IDBD, and Hypergradient descent, and the relation of step-size adaptation to meta-gradient methods. In the second part of this paper, we propose a new class of methods of accelerating gradient descent that have some distinctiveness from existing techniques. The new methods, which we call {\em step-size planning}, use the {\em update experience} to learn an improved way of updating the parameters. The methods organize the experience into $K$ steps away from each other to facilitate planning. From the past experience, our planning algorithm, Csawg, learns a step-size model which is a form of multi-step machine that predicts future updates. We extends Csawg to applying step-size planning multiple steps, which leads to further speedup. We discuss and highlight the projection power of the diagonal-matrix step-size for future large scale applications. We show for a convex problem, our methods can surpass the convergence rate of Nesterov's accelerated gradient, $1 - \sqrt{\mu/L}$, where $\mu, L$ are the strongly convex factor of the loss function $F$ and the Lipschitz constant of $F'$, which is the theoretical limit for the convergence rate of first-order methods. On the well-known non-convex Rosenbrock function, our planning methods achieve zero error below 500 gradient evaluations, while gradient descent takes about 10000 gradient evaluations to reach a $10^{-3}$ accuracy. We discuss the connection of step-size planing to planning in reinforcement learning, in particular, Dyna architectures. (This is a shorter abstract than in the paper because of length requirement)
    DT+GNN: A Fully Explainable Graph Neural Network using Decision Trees. (arXiv:2205.13234v1 [cs.LG])
    We propose the fully explainable Decision Tree Graph Neural Network (DT+GNN) architecture. In contrast to existing black-box GNNs and post-hoc explanation methods, the reasoning of DT+GNN can be inspected at every step. To achieve this, we first construct a differentiable GNN layer, which uses a categorical state space for nodes and messages. This allows us to convert the trained MLPs in the GNN into decision trees. These trees are pruned using our newly proposed method to ensure they are small and easy to interpret. We can also use the decision trees to compute traditional explanations. We demonstrate on both real-world datasets and synthetic GNN explainability benchmarks that this architecture works as well as traditional GNNs. Furthermore, we leverage the explainability of DT+GNNs to find interesting insights into many of these datasets, with some surprising results. We also provide an interactive web tool to inspect DT+GNN's decision making.
    Friends to Help: Saving Federated Learning from Client Dropout. (arXiv:2205.13222v1 [cs.LG])
    Federated learning (FL) is an outstanding distributed machine learning framework due to its benefits on data privacy and communication efficiency. Since full client participation in many cases is infeasible due to constrained resources, partial participation FL algorithms have been investigated that proactively select/sample a subset of clients, aiming to achieve learning performance close to the full participation case. This paper studies a passive partial client participation scenario that is much less well understood, where partial participation is a result of external events, namely client dropout, rather than a decision of the FL algorithm. We cast FL with client dropout as a special case of a larger class of FL problems where clients can submit substitute (possibly inaccurate) local model updates. Based on our convergence analysis, we develop a new algorithm FL-FDMS that discovers friends of clients (i.e., clients whose data distributions are similar) on-the-fly and uses friends' local updates as substitutes for the dropout clients, thereby reducing the substitution error and improving the convergence performance. A complexity reduction mechanism is also incorporated into FL-FDMS, making it both theoretically sound and practically useful. Experiments on MNIST and CIFAR-10 confirmed the superior performance of FL-FDMS in handling client dropout in FL.
    Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization. (arXiv:2205.13209v1 [cs.LG])
    Deep reinforcement learning (DRL)-based combinatorial optimization (CO) methods (i.e., DRL-NCO) have shown significant merit over the conventional CO solvers as DRL-NCO is capable of learning CO solvers without supervised labels attained from the verified solver. This paper presents a novel training scheme, Sym-NCO, that achieves significant performance increments to existing DRL-NCO methods. Sym-NCO is a regularizer-based training scheme that leverages universal symmetricities in various CO problems and solutions. Imposing symmetricities such as rotational and reflectional invariance can greatly improve generalization capability of DRL-NCO as symmetricities are invariant features shared by certain CO tasks. Our experimental results verify that our Sym-NCO greatly improves the performance of DRL-NCO methods in four CO tasks, including traveling salesman problem (TSP), capacitated vehicle routing problem (CVRP), prize collecting TSP (PCTSP), and orienteering problem (OP), without employing problem-specific techniques. Remarkably, Sym-NCO outperformed not only the existing DRL-NCO methods but also a competitive conventional solver, the iterative local search (ILS), in PCTSP at 240 times faster speed.
    Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks. (arXiv:2205.13283v1 [cs.LG])
    Unraveling the general structure underlying the loss landscapes of deep neural networks (DNNs) is important for the theoretical study of deep learning. Inspired by the embedding principle of DNN loss landscape, we prove in this work an embedding principle in depth that loss landscape of an NN "contains" all critical points of the loss landscapes for shallower NNs. Specifically, we propose a critical lifting operator that any critical point of a shallower network can be lifted to a critical manifold of the target network while preserving the outputs. Through lifting, local minimum of an NN can become a strict saddle point of a deeper NN, which can be easily escaped by first-order methods. The embedding principle in depth reveals a large family of critical points in which layer linearization happens, i.e., computation of certain layers is effectively linear for the training inputs. We empirically demonstrate that, through suppressing layer linearization, batch normalization helps avoid the lifted critical manifolds, resulting in a faster decay of loss. We also demonstrate that increasing training data reduces the lifted critical manifold thus could accelerate the training. Overall, the embedding principle in depth well complements the embedding principle (in width), resulting in a complete characterization of the hierarchical structure of critical points/manifolds of a DNN loss landscape.
    Aggregating Gradients in Encoded Domain for Federated Learning. (arXiv:2205.13216v1 [cs.CR])
    Malicious attackers and an honest-but-curious server can steal private client data from uploaded gradients in federated learning. Although current protection methods (e.g., additive homomorphic cryptosystem) can guarantee the security of the federated learning system, they bring additional computation and communication costs. To mitigate the cost, we propose the \texttt{FedAGE} framework, which enables the server to aggregate gradients in an encoded domain without accessing raw gradients of any single client. Thus, \texttt{FedAGE} can prevent the curious server from gradient stealing while maintaining the same prediction performance without additional communication costs. Furthermore, we theoretically prove that the proposed encoding-decoding framework is a Gaussian mechanism for differential privacy. Finally, we evaluate \texttt{FedAGE} under several federated settings, and the results have demonstrated the efficacy of the proposed framework.
    Federated Non-negative Matrix Factorization for Short Texts Topic Modeling with Mutual Information. (arXiv:2205.13300v1 [cs.CL])
    Non-negative matrix factorization (NMF) based topic modeling is widely used in natural language processing (NLP) to uncover hidden topics of short text documents. Usually, training a high-quality topic model requires large amount of textual data. In many real-world scenarios, customer textual data should be private and sensitive, precluding uploading to data centers. This paper proposes a Federated NMF (FedNMF) framework, which allows multiple clients to collaboratively train a high-quality NMF based topic model with locally stored data. However, standard federated learning will significantly undermine the performance of topic models in downstream tasks (e.g., text classification) when the data distribution over clients is heterogeneous. To alleviate this issue, we further propose FedNMF+MI, which simultaneously maximizes the mutual information (MI) between the count features of local texts and their topic weight vectors to mitigate the performance degradation. Experimental results show that our FedNMF+MI methods outperform Federated Latent Dirichlet Allocation (FedLDA) and the FedNMF without MI methods for short texts by a significant margin on both coherence score and classification F1 score.
    Distributed Contextual Linear Bandits with Minimax Optimal Communication Cost. (arXiv:2205.13170v1 [cs.LG])
    We study distributed contextual linear bandits with stochastic contexts, where $N$ agents act cooperatively to solve a linear bandit-optimization problem with $d$-dimensional features. For this problem, we propose a distributed batch elimination version of the LinUCB algorithm, DisBE-LUCB, where the agents share information among each other through a central server. We prove that over $T$ rounds ($NT$ actions in total) the communication cost of DisBE-LUCB is only $\tilde{\mathcal{O}}(dN)$ and its regret is at most $\tilde{\mathcal{O}}(\sqrt{dNT})$, which is of the same order as that incurred by an optimal single-agent algorithm for $NT$ rounds. Remarkably, we derive an information-theoretic lower bound on the communication cost of the distributed contextual linear bandit problem with stochastic contexts, and prove that our proposed algorithm is nearly minimax optimal in terms of \emph{both regret and communication cost}. Finally, we propose DecBE-LUCB, a fully decentralized version of DisBE-LUCB, which operates without a central server, where agents share information with their \emph{immediate neighbors} through a carefully designed consensus procedure.
    Entropy Maximization with Depth: A Variational Principle for Random Neural Networks. (arXiv:2205.13076v1 [cs.LG])
    To understand the essential role of depth in neural networks, we investigate a variational principle for depth: Does increasing depth perform an implicit optimization for the representations in neural networks? We prove that random neural networks equipped with batch normalization maximize the differential entropy of representations with depth up to constant factors, assuming that the representations are contractive. Thus, representations inherently obey the \textit{principle of maximum entropy} at initialization, in the absence of information about the learning task. Our variational formulation for neural representations characterizes the interplay between representation entropy and architectural components, including depth, width, and non-linear activations, thereby potentially inspiring the design of neural architectures.
    Fast Vision Transformers with HiLo Attention. (arXiv:2205.13213v1 [cs.CV])
    Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i.e., FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of different frequencies. Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group performs the attention to model the global relationship between the average-pooled low-frequency keys from each window and each query position in the input feature map. Benefit from the efficient design for both groups, we show that HiLo is superior to the existing attention mechanisms by comprehensively benchmarking on FLOPs, speed and memory consumption on GPUs. Powered by HiLo, LITv2 serves as a strong backbone for mainstream vision tasks including image classification, dense detection and segmentation. Code is available at https://github.com/zip-group/LITv2.
    More Recent Advances in (Hyper)Graph Partitioning. (arXiv:2205.13202v1 [cs.DS])
    In recent years, significant advances have been made in the design and evaluation of balanced (hyper)graph partitioning algorithms. We survey trends of the last decade in practical algorithms for balanced (hyper)graph partitioning together with future research directions. Our work serves as an update to a previous survey on the topic. In particular, the survey extends the previous survey by also covering hypergraph partitioning and streaming algorithms, and has an additional focus on parallel algorithms.
    Unsupervised Learning From Incomplete Measurements for Inverse Problems. (arXiv:2201.12151v3 [stat.ML] UPDATED)
    In many real-world inverse problems, only incomplete measurement data are available for training which can pose a problem for learning a reconstruction function. Indeed, unsupervised learning using a fixed incomplete measurement process is impossible in general, as there is no information in the nullspace of the measurement operator. This limitation can be overcome by using measurements from multiple operators. While this idea has been successfully applied in various applications, a precise characterization of the conditions for learning is still lacking. In this paper, we fill this gap by presenting necessary and sufficient conditions for learning the underlying signal model needed for reconstruction which indicate the interplay between the number of distinct measurement operators, the number of measurements per operator, the dimension of the model and the dimension of the signals. Furthermore, we propose a novel and conceptually simple unsupervised learning loss which only requires access to incomplete measurement data and achieves a performance on par with supervised learning when the sufficient condition is verified. We validate our theoretical bounds and demonstrate the advantages of the proposed unsupervised loss compared to previous methods via a series of experiments on various imaging inverse problems, such as accelerated magnetic resonance imaging, compressed sensing and image inpainting.
    Forecasting Patient Demand at Urgent Care Clinics using Machine Learning. (arXiv:2205.13067v1 [cs.LG])
    Urgent care clinics and emergency departments around the world periodically suffer from extended wait times beyond patient expectations due to inadequate staffing levels. These delays have been linked with adverse clinical outcomes. Previous research into forecasting demand this domain has mostly used a collection of statistical techniques, with machine learning approaches only now beginning to emerge in recent literature. The forecasting problem for this domain is difficult and has also been complicated by the COVID-19 pandemic which has introduced an additional complexity to this estimation due to typical demand patterns being disrupted. This study explores the ability of machine learning methods to generate accurate patient presentations at two large urgent care clinics located in Auckland, New Zealand. A number of machine learning algorithms were explored in order to determine the most effective technique for this problem domain, with the task of making forecasts of daily patient demand three months in advance. The study also performed an in-depth analysis into the model behaviour in respect to the exploration of which features are most effective at predicting demand and which features are capable of adaptation to the volatility caused by the COVID-19 pandemic lockdowns. The results showed that ensemble-based methods delivered the most accurate and consistent solutions on average, generating improvements in the range of 23%-27% over the existing in-house methods for estimating the daily demand.
    Tight Lower Bounds on Worst-Case Guarantees for Zero-Shot Learning with Attributes. (arXiv:2205.13068v1 [cs.LG])
    We develop a rigorous mathematical analysis of zero-shot learning with attributes. In this setting, the goal is to label novel classes with no training data, only detectors for attributes and a description of how those attributes are correlated with the target classes, called the class-attribute matrix. We develop the first non-trivial lower bound on the worst-case error of the best map from attributes to classes for this setting, even with perfect attribute detectors. The lower bound characterizes the theoretical intrinsic difficulty of the zero-shot problem based on the available information -- the class-attribute matrix -- and the bound is practically computable from it. Our lower bound is tight, as we show that we can always find a randomized map from attributes to classes whose expected error is upper bounded by the value of the lower bound. We show that our analysis can be predictive of how standard zero-shot methods behave in practice, including which classes will likely be confused with others.
    Grammar Detection for Sentiment Analysis through Improved Viterbi Algorithm. (arXiv:2205.13148v1 [cs.CL])
    Grammar Detection, also referred to as Parts of Speech Tagging of raw text, is considered an underlying building block of the various Natural Language Processing pipelines like named entity recognition, question answering, and sentiment analysis. In short, forgiven a sentence, Parts of Speech tagging is the task of specifying and tagging each word of a sentence with nouns, verbs, adjectives, adverbs, and more. Sentiment Analysis may well be a procedure accustomed to determining if a given sentence's emotional tone is neutral, positive or negative. To assign polarity scores to the thesis or entities within phrase, in-text analysis and analytics, machine learning and natural language processing, approaches are incorporated. This Sentiment Analysis using POS tagger helps us urge a summary of the broader public over a specific topic. For this, we are using the Viterbi algorithm, Hidden Markov Model, Constraint based Viterbi algorithm for POS tagging. By comparing the accuracies, we select the foremost accurate result of the model for Sentiment Analysis for determining the character of the sentence.
    Learning to segment with limited annotations: Self-supervised pretraining with regression and contrastive loss in MRI. (arXiv:2205.13109v1 [cs.CV])
    Obtaining manual annotations for large datasets for supervised training of deep learning (DL) models is challenging. The availability of large unlabeled datasets compared to labeled ones motivate the use of self-supervised pretraining to initialize DL models for subsequent segmentation tasks. In this work, we consider two pre-training approaches for driving a DL model to learn different representations using: a) regression loss that exploits spatial dependencies within an image and b) contrastive loss that exploits semantic similarity between pairs of images. The effect of pretraining techniques is evaluated in two downstream segmentation applications using Magnetic Resonance (MR) images: a) liver segmentation in abdominal T2-weighted MR images and b) prostate segmentation in T2-weighted MR images of the prostate. We observed that DL models pretrained using self-supervision can be finetuned for comparable performance with fewer labeled datasets. Additionally, we also observed that initializing the DL model using contrastive loss based pretraining performed better than the regression loss.
    Independent Asymmetric Embedding for Information Diffusion Prediction on Social Networks. (arXiv:2105.08291v6 [cs.LG] UPDATED)
    The prediction for information diffusion on social networks has great practical significance in marketing and public opinion control. It aims to predict the individuals who will potentially repost the message on the social network. One type of method is based on demographics, complex networks and other prior knowledge to establish an interpretable model to simulate and predict the propagation process, while the other type of method is completely data-driven and maps the nodes to a latent space for propagation prediction. Existing latent space design and embedding methods lack consideration for the intervene among users. In this paper, we propose an independent asymmetric embedding method to embed each individual into one latent influence space and multiple latent susceptibility spaces. Based on the similarity between information diffusion and heat diffusion phenomenon, the heat diffusion kernel is exploited in our model and establishes the embedding rules. Furthermore, our method captures the co-occurrence regulation of user combinations in cascades to improve the calculating effectiveness. The results of extensive experiments conducted on real-world datasets verify both the predictive accuracy and cost-effectiveness of our approach.
    Deep-XFCT: Deep learning 3D-mineral liberation analysis with micro X-ray fluorescence and computed tomography. (arXiv:2205.13102v1 [cs.LG])
    The rapid development of X-ray micro-computed tomography (micro-CT) opens new opportunities for 3D analysis of particle and grain-size characterisation, determination of particle densities and shape factors, estimation of mineral associations and liberation and locking. Current practices in mineral liberation analysis are based on 2D representations leading to systematic errors in the extrapolation to volumetric properties. New quantitative methods based on tomographic data are therefore urgently required for characterisation of mineral deposits, mineral processing, characterisation of tailings, rock typing, stratigraphic refinement, reservoir characterisation for applications in the resource industry, environmental and material sciences. To date, no simple non-destructive method exists for 3D mineral liberation analysis. We present a new development based on combining micro-CT with micro-X-ray fluorescence (micro-XRF) using deep learning. We demonstrate successful semi-automated multi-modal analysis of a crystalline magmatic rock where the new technique overcomes the difficult task of differentiating feldspar from quartz in micro-CT data set. The approach is universal and can be extended to any multi-modal and multi-instrument analysis for further refinement. We conclude that the combination of micro-CT and micro-XRF already provides a new opportunity for robust 3D mineral liberation analysis in both field and laboratory applications.
    Contextual Pandora's Box. (arXiv:2205.13114v1 [cs.LG])
    Pandora's Box is a fundamental stochastic optimization problem, where the decision-maker must find a good alternative while minimizing the search cost of exploring the value of each alternative. In the original formulation, it is assumed that accurate priors are given for the values of all the alternatives, while recent work studies the online variant of Pandora's Box where priors are originally unknown. In this work, we extend Pandora's Box to the online setting, while incorporating context. At every round, we are presented with a number of alternatives each having a context, an exploration cost and an unknown value drawn from an unknown prior distribution that may change at every round. Our main result is a no-regret algorithm that performs comparably well to the optimal algorithm which knows all prior distributions exactly. Our algorithm works even in the bandit setting where the algorithm never learns the values of the alternatives that were not explored. The key technique that enables our result is novel a modification of the realizability condition in contextual bandits that connects a context to the reservation value of the corresponding distribution rather than its mean
    Unsupervised Reinforcement Adaptation for Class-Imbalanced Text Classification. (arXiv:2205.13139v1 [cs.CL])
    Class imbalance naturally exists when train and test models in different domains. Unsupervised domain adaptation (UDA) augments model performance with only accessible annotations from the source domain and unlabeled data from the target domain. However, existing state-of-the-art UDA models learn domain-invariant representations and evaluate primarily on class-balanced data across domains. In this work, we propose an unsupervised domain adaptation approach via reinforcement learning that jointly leverages feature variants and imbalanced labels across domains. We experiment with the text classification task for its easily accessible datasets and compare the proposed method with five baselines. Experiments on three datasets prove that our proposed method can effectively learn robust domain-invariant representations and successfully adapt text classifiers on imbalanced classes over domains. The code is available at https://github.com/woqingdoua/ImbalanceClass.
    Leveraging Dependency Grammar for Fine-Grained Offensive Language Detection using Graph Convolutional Networks. (arXiv:2205.13164v1 [cs.CL])
    The last few years have witnessed an exponential rise in the propagation of offensive text on social media. Identification of this text with high precision is crucial for the well-being of society. Most of the existing approaches tend to give high toxicity scores to innocuous statements (e.g., "I am a gay man"). These false positives result from over-generalization on the training data where specific terms in the statement may have been used in a pejorative sense (e.g., "gay"). Emphasis on such words alone can lead to discrimination against the classes these systems are designed to protect. In this paper, we address the problem of offensive language detection on Twitter, while also detecting the type and the target of the offence. We propose a novel approach called SyLSTM, which integrates syntactic features in the form of the dependency parse tree of a sentence and semantic features in the form of word embeddings into a deep learning architecture using a Graph Convolutional Network. Results show that the proposed approach significantly outperforms the state-of-the-art BERT model with orders of magnitude fewer number of parameters.
    Towards Green AI with tensor networks -- Sustainability and innovation enabled by efficient algorithms. (arXiv:2205.12961v1 [cs.LG])
    The current standard to compare the performance of AI algorithms is mainly based on one criterion: the model's accuracy. In this context, algorithms with a higher accuracy (or similar measures) are considered as better. To achieve new state-of-the-art results, algorithmic development is accompanied by an exponentially increasing amount of compute. While this has enabled AI research to achieve remarkable results, AI progress comes at a cost: it is unsustainable. In this paper, we present a promising tool for sustainable and thus Green AI: tensor networks (TNs). Being an established tool from multilinear algebra, TNs have the capability to improve efficiency without compromising accuracy. Since they can reduce compute significantly, we would like to highlight their potential for Green AI. We elaborate in both a kernel machine and deep learning setting how efficiency gains can be achieved with TNs. Furthermore, we argue that better algorithms should be evaluated in terms of both accuracy and efficiency. To that end, we discuss different efficiency criteria and analyze efficiency in an exemplifying experimental setting for kernel ridge regression. With this paper, we want to raise awareness about Green AI and showcase its positive impact on sustainability and AI research. Our key contribution is to demonstrate that TNs enable efficient algorithms and therefore contribute towards Green AI. In this sense, TNs pave the way for better algorithms in AI.
    Formalizing Preferences Over Runtime Distributions. (arXiv:2205.13028v1 [cs.AI])
    When trying to solve a computational problem we are often faced with a choice among algorithms that are all guaranteed to return the right answer but that differ in their runtime distributions (e.g., SAT solvers, sorting algorithms). This paper aims to lay theoretical foundations for such choices by formalizing preferences over runtime distributions. It might seem that we should simply prefer the algorithm that minimizes expected runtime. However, such preferences would be driven by exactly how slow our algorithm is on bad inputs, whereas in practice we are typically willing to cut off occasional, sufficiently long runs before they finish. We propose a principled alternative, taking a utility-theoretic approach to characterize the scoring functions that describe preferences over algorithms. These functions depend on the way our value for solving our problem decreases with time and on the distribution from which captimes are drawn. We describe examples of realistic utility functions and show how to leverage a maximum-entropy approach for modeling underspecified captime distributions. Finally, we show how to efficiently estimate an algorithm's expected utility from runtime samples.
    Uniform Generalization Bound on Time and Inverse Temperature for Gradient Descent Algorithm and its Application to Analysis of Simulated Annealing. (arXiv:2205.12959v1 [cs.LG])
    In this paper, we propose a novel uniform generalization bound on the time and inverse temperature for stochastic gradient Langevin dynamics (SGLD) in a non-convex setting. While previous works derive their generalization bounds by uniform stability, we use Rademacher complexity to make our generalization bound independent of the time and inverse temperature. Using Rademacher complexity, we can reduce the problem to derive a generalization bound on the whole space to that on a bounded region and therefore can remove the effect of the time and inverse temperature from our generalization bound. As an application of our generalization bound, an evaluation on the effectiveness of the simulated annealing in a non-convex setting is also described. For the sample size $n$ and time $s$, we derive evaluations with orders $\sqrt{n^{-1} \log (n+1)}$ and $|(\log)^4(s)|^{-1}$, respectively. Here, $(\log)^4$ denotes the $4$ times composition of the logarithmic function.
    Concurrent Neural Tree and Data Preprocessing AutoML for Image Classification. (arXiv:2205.13033v1 [cs.LG])
    Deep Neural Networks (DNN's) are a widely-used solution for a variety of machine learning problems. However, it is often necessary to invest a significant amount of a data scientist's time to pre-process input data, test different neural network architectures, and tune hyper-parameters for optimal performance. Automated machine learning (autoML) methods automatically search the architecture and hyper-parameter space for optimal neural networks. However, current state-of-the-art (SOTA) methods do not include traditional methods for manipulating input data as part of the algorithmic search space. We adapt the Evolutionary Multi-objective Algorithm Design Engine (EMADE), a multi-objective evolutionary search framework for traditional machine learning methods, to perform neural architecture search. We also integrate EMADE's signal processing and image processing primitives. These primitives allow EMADE to manipulate input data before ingestion into the simultaneously evolved DNN. We show that including these methods as part of the search space shows potential to provide benefits to performance on the CIFAR-10 image classification benchmark dataset.
    Variance-Aware Sparse Linear Bandits. (arXiv:2205.13450v1 [cs.LG])
    It is well-known that the worst-case minimax regret for sparse linear bandits is $\widetilde{\Theta}\left(\sqrt{dT}\right)$ where $d$ is the ambient dimension and $T$ is the number of time steps (ignoring the dependency on sparsity). On the other hand, in the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $\widetilde{\mathcal O}(1)$ regret, which is (nearly) independent of $d$ and $T$. In this paper, we present the first variance-aware regret guarantee for sparse linear bandits: $\widetilde{\mathcal O}\left(\sqrt{d\sum_{t=1}^T \sigma_t^2} + 1\right)$, where $\sigma_t^2$ is the variance of the noise at the $t$-th time step. This bound naturally interpolates the regret bounds for the worst-case constant-variance regime ($\sigma_t = \Omega(1)$) and the benign deterministic regimes ($\sigma_t = 0$). To achieve this variance-aware regret guarantee, we develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits in a ``black-box'' manner. Specifically, we take two recent algorithms as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.
    Learning the spatio-temporal relationship between wind and significant wave height using deep learning. (arXiv:2205.13325v1 [stat.ML])
    Ocean wave climate has a significant impact on near-shore and off-shore human activities, and its characterisation can help in the design of ocean structures such as wave energy converters and sea dikes. Therefore, engineers need long time series of ocean wave parameters. Numerical models are a valuable source of ocean wave data; however, they are computationally expensive. Consequently, statistical and data-driven approaches have gained increasing interest in recent decades. This work investigates the spatio-temporal relationship between North Atlantic wind and significant wave height (Hs) at an off-shore location in the Bay of Biscay, using a two-stage deep learning model. The first step uses convolutional neural networks (CNNs) to extract the spatial features that contribute to Hs. Then, long short-term memory (LSTM) is used to learn the long-term temporal dependencies between wind and waves.
    Towards Symbolic Time Series Representation Improved by Kernel Density Estimators. (arXiv:2205.12960v1 [cs.LG])
    This paper deals with symbolic time series representation. It builds up on the popular mapping technique Symbolic Aggregate approXimation algorithm (SAX), which is extensively utilized in sequence classification, pattern mining, anomaly detection, time series indexing and other data mining tasks. However, the disadvantage of this method is, that it works reliably only for time series with Gaussian-like distribution. In our previous work we have proposed an improvement of SAX, called dwSAX, which can deal with Gaussian as well as non-Gaussian data distribution. Recently we have made further progress in our solution - edwSAX. Our goal was to optimally cover the information space by means of sufficient alphabet utilization; and to satisfy lower bounding criterion as tight as possible. We describe here our approach, including evaluation on commonly employed tasks such as time series reconstruction error and Euclidean distance lower bounding with promising improvements over SAX.  ( 2 min )
    QADAM: Quantization-Aware DNN Accelerator Modeling for Pareto-Optimality. (arXiv:2205.13045v1 [cs.AR])
    As the machine learning and systems communities strive to achieve higher energy-efficiency through custom deep neural network (DNN) accelerators, varied bit precision or quantization levels, there is a need for design space exploration frameworks that incorporate quantization-aware processing elements (PE) into the accelerator design space while having accurate and fast power, performance, and area models. In this work, we present QADAM, a highly parameterized quantization-aware power, performance, and area modeling framework for DNN accelerators. Our framework can facilitate future research on design space exploration and Pareto-efficiency of DNN accelerators for various design choices such as bit precision, PE type, scratchpad sizes of PEs, global buffer size, number of total PEs, and DNN configurations. Our results show that different bit precisions and PE types lead to significant differences in terms of performance per area and energy. Specifically, our framework identifies a wide range of design points where performance per area and energy varies more than 5x and 35x, respectively. We also show that the proposed lightweight processing elements (LightPEs) consistently achieve Pareto-optimal results in terms of accuracy and hardware-efficiency. With the proposed framework, we show that LightPEs achieve on par accuracy results and up to 5.7x more performance per area and energy improvement when compared to the best INT16 based design.
    RACE: A Reinforcement Learning Framework for Improved Adaptive Control of NoC Channel Buffers. (arXiv:2205.13130v1 [cs.AR])
    Network-on-chip (NoC) architectures rely on buffers to store flits to cope with contention for router resources during packet switching. Recently, reversible multi-function channel (RMC) buffers have been proposed to simultaneously reduce power and enable adaptive NoC buffering between adjacent routers. While adaptive buffering can improve NoC performance by maximizing buffer utilization, controlling the RMC buffer allocations requires a congestion-aware, scalable, and proactive policy. In this work, we present RACE, a novel reinforcement learning (RL) framework that utilizes better awareness of network congestion and a new reward metric ("falsefulls") to help guide the RL agent towards better RMC buffer control decisions. We show that RACE reduces NoC latency by up to 48.9%, and energy consumption by up to 47.1% against state-of-the-art NoC buffer control policies.
  • Open

    Ranking the information content of distance measures. (arXiv:2104.15079v2 [stat.ML] UPDATED)
    Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Using the fewest features but still retaining sufficient information about the system is crucial in many statistical learning approaches, particularly when data are sparse. We introduce a statistical test that can assess the relative information retained when using two different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This in turn allows finding the most informative distance measure out of a pool of candidates. The approach is applied to find the most relevant policy variables for controlling the Covid-19 epidemic and to find compact yet informative representations of atomic structures, but its potential applications are wide ranging in many branches of science.  ( 2 min )
    Machine Learning Construction: implications to cybersecurity. (arXiv:1906.10019v3 [cs.LG] UPDATED)
    Statistical learning is the process of estimating an unknown probabilistic input-output relationship of a system using a limited number of observations; a statistical learning machine (SLM) is the algorithm, function, model, or rule, that learns such a process; and machine learning (ML) is the conventional name of this field. ML and its applications are ubiquitous in the modern world. Cyberphysical systems such as Automatic target recognition (ATR) in military applications, computer aided diagnosis (CAD) in medical imaging, DNA microarrays in genomics, optical character recognition (OCR), speech recognition (SR), spam email filtering, stock market prediction, etc., are few examples and applications for ML; diverse fields but one theory. In particular, ML has gained a lot of attention in the field of cyberphysical security, especially in the last decade. It is of great importance to this field to design detection algorithms that have the capability of learning from security data to be able to hunt threats, achieve better monitoring, master the complexity of the threat intelligence feeds, and achieve timely remediation of security incidents. The field of ML can be decomposed into two basic subfields: \textit{construction} and \textit{assessment}. We mean by \textit{construction} designing or inventing an appropriate algorithm that learns from the input data and achieves a good performance according to some optimality criterion. We mean by \textit{assessment} attributing some performance measures to the constructed ML algorithm, along with their estimators, to objectively assess this algorithm.  ( 2 min )
    Reproducibility in Optimization: Theoretical Framework and Limits. (arXiv:2202.04598v2 [math.OC] UPDATED)
    We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility.  ( 2 min )
    A Penalized Shared-parameter Algorithm for Estimating Optimal Dynamic Treatment Regimens. (arXiv:2107.07875v2 [stat.ML] UPDATED)
    A dynamic treatment regimen (DTR) is a set of decision rules to personalize treatments for an individual using their medical history. The Q-learning based Q-shared algorithm has been used to develop DTRs that involve decision rules shared across multiple stages of intervention. We show that the existing Q-shared algorithm can suffer from non-convergence due to the use of linear models in the Q-learning setup, and identify the condition in which Q-shared fails. Leveraging properties from expansion-constrained ordinary least-squares, we give a penalized Q-shared algorithm that not only converges in settings that violate the condition, but can outperform the original Q-shared algorithm even when the condition is satisfied. We give evidence for the proposed method in a real-world application and several synthetic simulations.  ( 2 min )
    Worst-case Performance of Greedy Policies in Bandits with Imperfect Context Observations. (arXiv:2204.04773v2 [stat.ML] UPDATED)
    Contextual bandits are canonical models for sequential decision-making under uncertainty in environments with time-varying components. In this setting, the expected reward of each bandit arm consists of the inner product of an unknown parameter with the context vector of that arm. The classical bandit settings heavily rely on assuming that the contexts are fully observed, while study of the richer model of imperfectly observed contextual bandits is immature. This work considers Greedy reinforcement learning policies that take actions as if the current estimates of the parameter and of the unobserved contexts coincide with the corresponding true values. We establish that the non-asymptotic worst-case regret grows poly-logarithmically with the time horizon and the failure probability, while it scales linearly with the number of arms. Numerical analysis showcasing the above efficiency of Greedy policies is also provided.  ( 2 min )
    Improved Fine-Tuning by Better Leveraging Pre-Training Data. (arXiv:2111.12292v2 [cs.CV] UPDATED)
    As a dominant paradigm, fine-tuning a pre-trained model on the target data is widely used in many deep learning applications, especially for small data sets. However, recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy once the number of training samples is increased in some vision tasks. In this work, we revisit this phenomenon from the perspective of generalization analysis by using excess risk bound which is popular in learning theory. The result reveals that the excess risk bound may have a weak dependency on the pre-trained model. The observation inspires us to leverage pre-training data for fine-tuning, since this data is also available for fine-tuning. The generalization result of using pre-training data shows that the excess risk bound on a target task can be improved when the appropriate pre-training data is included in fine-tuning. With the theoretical motivation, we propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task. Extensive experimental results for image classification tasks on 8 benchmark data sets verify the effectiveness of the proposed data selection based fine-tuning pipeline.  ( 2 min )
    Amplitude Mean of Functional Data on $\mathbb{S}^2$. (arXiv:2107.13721v4 [stat.ML] UPDATED)
    Manifold-valued functional data analysis (FDA) recently becomes an active area of research motivated by the raising availability of trajectories or longitudinal data observed on non-linear manifolds. The challenges of analyzing such data come from many aspects, including infinite dimensionality and nonlinearity, as well as time-domain or phase variability. In this paper, we study the amplitude part of manifold-valued functions on $\mathbb{S}^2$, which is invariant to random time warping or re-parameterization. Utilizing the nice geometry of $\mathbb{S}^2$, we develop a set of efficient and accurate tools for temporal alignment of functions, geodesic computing, and sample mean calculation. At the heart of these tools, they rely on gradient descent algorithms with carefully derived gradients. We show the advantages of these newly developed tools over its competitors with extensive simulations and real data and demonstrate the importance of considering the amplitude part of functions instead of mixing it with phase variability in manifold-valued FDA.  ( 2 min )
    Lorentzian Fully Hyperbolic Generative Adversarial Network. (arXiv:2201.12825v2 [cs.LG] UPDATED)
    With the recent advance of deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. While a variety of hyperbolic neural network structures have been proposed, they mainly focus on discriminative tasks, and generative models in the hyperbolic space have scarcely been studied. In this work, we propose a hyperbolic generative adversarial network (GAN) within the Lorentz model for generating hyperbolic data. In addition to existing hyperbolic operations, we design novel hyperbolic layers to guarantee stable training. We first use synthetic data to show that our network is able to learn simple distribution in the hyperbolic space. Moreover, by virtue of an autoencoder, we construct a neural network model, named HAEGAN, for generating more complex data in the hyperbolic space. HAEGAN contains three parts: first, a hyperbolic autoencoder; second, a hyperbolic GAN for generating the latent embedding of the autoencoder; third, a generator that inherits the decoder from autoencoder and the generator from the GAN. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.  ( 2 min )
    Mesoscopic modeling of hidden spiking neurons. (arXiv:2205.13493v1 [q-bio.NC])
    Can we use spiking neural networks (SNN) as generative models of multi-neuronal recordings, while taking into account that most neurons are unobserved? Modeling the unobserved neurons with large pools of hidden spiking neurons leads to severely underconstrained problems that are hard to tackle with maximum likelihood estimation. In this work, we use coarse-graining and mean-field approximations to derive a bottom-up, neuronally-grounded latent variable model (neuLVM), where the activity of the unobserved neurons is reduced to a low-dimensional mesoscopic description. In contrast to previous latent variable models, neuLVM can be explicitly mapped to a recurrent, multi-population SNN, giving it a transparent biological interpretation. We show, on synthetic spike trains, that a few observed neurons are sufficient for neuLVM to perform efficient model inversion of large SNNs, in the sense that it can recover connectivity parameters, infer single-trial latent population activity, reproduce ongoing metastable dynamics, and generalize when subjected to perturbations mimicking photo-stimulation.  ( 2 min )
    Subspace clustering in high-dimensions: Phase transitions \& Statistical-to-Computational gap. (arXiv:2205.13527v1 [stat.ML])
    A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $\rho$, as well as the ratio $\alpha$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $\lambda_{\text{alg}} \ge k / \sqrt{\alpha} $ to perform better than random, and the information theoretic threshold at $\lambda_{\text{it}} \approx \sqrt{-k \rho \log{\rho}} / \sqrt{\alpha}$. Finally, we discuss the case of sub-extensive sparsity $\rho$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding.  ( 2 min )
    Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds. (arXiv:2110.12087v3 [cs.LG] UPDATED)
    Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior. To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints. We characterize the sample variance bounds and show that the decision made by BES is explainable. Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization.  ( 2 min )
    Coherent Probabilistic Aggregate Queries on Long-horizon Forecasts. (arXiv:2111.03394v2 [cs.LG] UPDATED)
    Long range forecasts are the starting point of many decision support systems that need to draw inference from high-level aggregate patterns on forecasted values. State of the art time-series forecasting methods are either subject to concept drift on long-horizon forecasts, or fail to accurately predict coherent and accurate high-level aggregates. In this work, we present a novel probabilistic forecasting method that produces forecasts that are coherent in terms of base level and predicted aggregate statistics. We achieve the coherency between predicted base-level and aggregate statistics using a novel inference method based on KL-divergence that can be solved efficiently in closed form. We show that our method improves forecast performance across both base level and unseen aggregates post inference on real datasets ranging three diverse domains. (\href{https://github.com/pratham16cse/AggForecaster}{Project URL})  ( 2 min )
    The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks. (arXiv:2108.11489v2 [stat.ML] UPDATED)
    The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of $\textit{benign overfitting}$ has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and sub-Gaussian. By leveraging recent results that characterize the implicit bias of this estimator, our bounds emphasize the role of both the quality of the initialization as well as the properties of the data covariance matrix in achieving low excess risk.  ( 2 min )
    Factor selection in screening experiments by aggregation over random models. (arXiv:2205.13497v1 [stat.ME])
    Screening experiments are useful for screening out a small number of truly important factors from a large number of potentially important factors. The Gauss-Dantzig Selector (GDS) is often the preferred analysis method for screening experiments. Just considering main-effects models can result in erroneous conclusions, but including interaction terms, even if restricted to two-factor interactions, increases the number of model terms dramatically and challenges the GDS analysis. We propose a new analysis method, called Gauss-Dantzig Selector Aggregation over Random Models (GDS-ARM), which performs a GDS analysis on multiple models that include only some randomly selected interactions. Results from these different analyses are then aggregated to identify the important factors. We discuss the proposed method, suggest choices for the tuning parameters, and study its performance on real and simulated data.  ( 2 min )
    Unsupervised Learning From Incomplete Measurements for Inverse Problems. (arXiv:2201.12151v3 [stat.ML] UPDATED)
    In many real-world inverse problems, only incomplete measurement data are available for training which can pose a problem for learning a reconstruction function. Indeed, unsupervised learning using a fixed incomplete measurement process is impossible in general, as there is no information in the nullspace of the measurement operator. This limitation can be overcome by using measurements from multiple operators. While this idea has been successfully applied in various applications, a precise characterization of the conditions for learning is still lacking. In this paper, we fill this gap by presenting necessary and sufficient conditions for learning the underlying signal model needed for reconstruction which indicate the interplay between the number of distinct measurement operators, the number of measurements per operator, the dimension of the model and the dimension of the signals. Furthermore, we propose a novel and conceptually simple unsupervised learning loss which only requires access to incomplete measurement data and achieves a performance on par with supervised learning when the sufficient condition is verified. We validate our theoretical bounds and demonstrate the advantages of the proposed unsupervised loss compared to previous methods via a series of experiments on various imaging inverse problems, such as accelerated magnetic resonance imaging, compressed sensing and image inpainting.  ( 2 min )
    When Doubly Robust Methods Meet Machine Learning for Estimating Treatment Effects from Real-World Data: A Comparative Study. (arXiv:2204.10969v2 [stat.ME] UPDATED)
    Observational cohort studies are increasingly being used for comparative effectiveness research to assess the safety of therapeutics. Recently, various doubly robust methods have been proposed for average treatment effect estimation by combining the treatment model and the outcome model via different vehicles, such as matching, weighting, and regression. The key advantage of doubly robust estimators is that they require either the treatment model or the outcome model to be correctly specified to obtain a consistent estimator of average treatment effects, and therefore lead to a more accurate and often more precise inference. However, little work has been done to understand how doubly robust estimators differ due to their unique strategies of using the treatment and outcome models and how machine learning techniques can be combined with these estimators to boost their performance. Also, little has been understood about the challenges of covariates selection, overlapping of covariate distribution, and treatment effect heterogeneity on the performance of these doubly robust estimators. Here we examine multiple popular doubly robust methods in the categories of matching, weighting, or regression, and compare their performance using different treatment and outcome modeling via extensive simulations and a real-world application. We found that incorporating machine learning with doubly robust estimators such as the targeted maximum likelihood estimator gives the best overall performance. Practical guidance on how to apply doubly robust estimators is provided.  ( 2 min )
    Selective Classification Via Neural Network Training Dynamics. (arXiv:2205.13532v1 [cs.LG])
    Selective classification is the task of rejecting inputs a model would predict incorrectly on through a trade-off between input space coverage and model accuracy. Current methods for selective classification impose constraints on either the model architecture or the loss function; this inhibits their usage in practice. In contrast to prior work, we show that state-of-the-art selective classification performance can be attained solely from studying the (discretized) training dynamics of a model. We propose a general framework that, for a given test input, monitors metrics capturing the disagreement with the final predicted label over intermediate models obtained during training; we then reject data points exhibiting too much disagreement at late stages in training. In particular, we instantiate a method that tracks when the label predicted during training stops disagreeing with the final predicted label. Our experimental evaluation shows that our method achieves state-of-the-art accuracy/coverage trade-offs on typical selective classification benchmarks. For example, we improve coverage on CIFAR-10/SVHN by 10.1%/1.5% respectively at a fixed target error of 0.5%.  ( 2 min )
    Forest Fire Clustering for Single-cell Sequencing with Iterative Label Propagation and Parallelized Monte Carlo Simulation. (arXiv:2103.11802v4 [cs.LG] UPDATED)
    In the era of single-cell sequencing, there is a growing need to extract insights from data with clustering methods. Here, we introduce Forest Fire Clustering, an efficient and interpretable method for cell-type discovery from single-cell data. Forest Fire Clustering makes minimal prior assumptions and, different from current approaches, calculates a non-parametric posterior probability that each cell is assigned a cell-type label. These posterior distributions allow for the evaluation of a label confidence for each cell and enable the computation of "label entropies," highlighting transitions along developmental trajectories. Furthermore, we show that Forest Fire Clustering can make robust, inductive inferences in an online-learning context and can readily scale to millions of cells. Finally, we demonstrate that our method outperforms state-of-the-art clustering approaches on diverse benchmarks of simulated and experimental data. Overall, Forest Fire Clustering is a useful tool for rare cell type discovery in large-scale single-cell analysis.  ( 2 min )
    Training ReLU networks to high uniform accuracy is intractable. (arXiv:2205.13531v1 [cs.LG])
    Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -- for example in a security-critical context or for problems in the computational sciences -- accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture. As a corollary we conclude that the training of ReLU neural networks to high uniform accuracy is intractable. In a security-critical context this points to the fact that deep learning based systems are prone to being fooled by a possible adversary. We corroborate our theoretical findings by numerical results.  ( 2 min )
    Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods. (arXiv:2205.11508v2 [cs.LG] UPDATED)
    Self-Supervised Learning (SSL) surmises that inputs and pairwise positive relationships are enough to learn meaningful representations. Although SSL has recently reached a milestone: outperforming supervised methods in many modalities\dots the theoretical foundations are limited, method-specific, and fail to provide principled design guidelines to practitioners. In this paper, we propose a unifying framework under the helm of spectral manifold learning to address those limitations. Through the course of this study, we will rigorously demonstrate that VICReg, SimCLR, BarlowTwins et al. correspond to eponymous spectral methods such as Laplacian Eigenmaps, Multidimensional Scaling et al. This unification will then allow us to obtain (i) the closed-form optimal representation for each method, (ii) the closed-form optimal network parameters in the linear regime for each method, (iii) the impact of the pairwise relations used during training on each of those quantities and on downstream task performances, and most importantly, (iv) the first theoretical bridge between contrastive and non-contrastive methods towards global and local spectral embedding methods respectively, hinting at the benefits and limitations of each. For example, (i) if the pairwise relation is aligned with the downstream task, any SSL method can be employed successfully and will recover the supervised method, but in the low data regime, VICReg's invariance hyper-parameter should be high; (ii) if the pairwise relation is misaligned with the downstream task, VICReg with small invariance hyper-parameter should be preferred over SimCLR or BarlowTwins.
    The Neural Testbed: Evaluating Joint Predictions. (arXiv:2110.04629v3 [cs.LG] UPDATED)
    Predictive distributions quantify uncertainties ignored by point estimates. This paper introduces \textit{The Neural Testbed}: an open-source benchmark for controlled and principled evaluation of agents that generate such predictions. Crucially, the testbed assesses agents not only on the quality of their marginal predictions per input, but also on their joint predictions across many inputs. We evaluate a range of agents using a simple neural network data generating process. Our results indicate that some popular Bayesian deep learning agents do not fare well with joint predictions, even when they can produce accurate marginal predictions. We also show that the quality of joint predictions drives performance in downstream decision tasks. We find these results are robust across choice a wide range of generative models, and highlight the practical importance of joint predictions to the community.
    Censored Quantile Regression Neural Networks. (arXiv:2205.13496v1 [stat.ML])
    This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.
    A framework for overparameterized learning. (arXiv:2205.13507v1 [cs.LG])
    An explanation for the success of deep neural networks is a central question in theoretical machine learning. According to classical statistical learning, the overparameterized nature of such models should imply a failure to generalize. Many argue that good empirical performance is due to the implicit regularization of first order optimization methods. In particular, the Polyak-{\L}ojasiewicz condition leads to gradient descent finding a global optimum that is close to initialization. In this work, we propose a framework consisting of a prototype learning problem, which is general enough to cover many popular problems and even the cases of infinitely wide neural networks and infinite data. We then perform an analysis from the perspective of the Polyak-{\L}ojasiewicz condition. We obtain theoretical results of independent interest, concerning gradient descent on a composition $(f \circ F): G \to \mathbb{R}$ of functions $F: G \to H$ and $f: H \to \mathbb{R}$ with $G, H$ being Hilbert spaces. Building on these results, we determine the properties that have to be satisfied by the components of the prototype problem for gradient descent to find a global optimum that is close to initialization. We then demonstrate that supervised learning, variational autoencoders and training with gradient penalty can be translated to the prototype problem. Finally, we lay out a number of directions for future research.
    RKHS-SHAP: Shapley Values for Kernel Methods. (arXiv:2110.09167v2 [stat.ML] UPDATED)
    Feature attribution for kernel methods is often heuristic and not individualised for each prediction. To address this, we turn to the concept of Shapley values~(SV), a coalition game theoretical framework that has previously been applied to different machine learning model interpretation tasks, such as linear models, tree ensembles and deep networks. By analysing SVs from a functional perspective, we propose \textsc{RKHS-SHAP}, an attribution method for kernel machines that can efficiently compute both \emph{Interventional} and \emph{Observational Shapley values} using kernel mean embeddings of distributions. We show theoretically that our method is robust with respect to local perturbations - a key yet often overlooked desideratum for consistent model interpretation. Further, we propose \emph{Shapley regulariser}, applicable to a general empirical risk minimisation framework, allowing learning while controlling the level of specific feature's contributions to the model. We demonstrate that the Shapley regulariser enables learning which is robust to covariate shift of a given feature and fair learning which controls the SVs of sensitive features.
    Towards the Generalization of Contrastive Self-Supervised Learning. (arXiv:2111.00743v3 [cs.LG] UPDATED)
    Recently, self-supervised learning has attracted great attention, since it only requires unlabeled data for training. Contrastive learning is one popular method for self-supervised learning and has achieved promising empirical performance. However, the theoretical understanding of its generalization ability is still limited. To this end, we define a kind of $(\sigma,\delta)$-measure to mathematically quantify the data augmentation, and then provide an upper bound of the downstream classification error based on the measure. We show that the generalization ability of contrastive self-supervised learning depends on three key factors: alignment of positive samples, divergence of class centers, and concentration of augmented data. The first two factors can be optimized by contrastive algorithms, while the third one is priorly determined by pre-defined data augmentation. With the above theoretical findings, we further study two canonical contrastive losses, InfoNCE and cross-correlation loss, and prove that both of them are indeed able to satisfy the first two factors. Moreover, we empirically verify the third factor by conducting various experiments on the real-world dataset, and show that our theoretical inferences on the relationship between the data augmentation and the generalization of contrastive self-supervised learning agree with the empirical observations.
    Machine Learning Assessment: implications to cybersecurity. (arXiv:1907.12851v4 [stat.ML] UPDATED)
    This chapter is dedicated to the assessment and performance estimation of machine learning (ML) algorithms, a topic that is equally important to the construction of these algorithms, in particular in the context of cyberphysical security design. The literature is full of nonparametric methods to estimate a statistic from just one available dataset through resampling techniques, e.g., jackknife, bootstrap and cross validation (CV). Special statistics of great interest are the error rate and the area under the ROC curve (AUC) of a classification rule. The importance of these resampling methods stems from the fact that they require no knowledge about the probability distribution of the data or the construction details of the ML algorithm. This chapter provides a concise review of this literature to establish a coherent theoretical framework for these methods that can estimate both the error rate (a one-sample statistic) and the AUC (a two-sample statistic). The resampling methods are usually computationally expensive, because they rely on repeating the training and testing of a ML algorithm after each resampling iteration. Therefore, the practical applicability of some of these methods may be limited to the traditional ML algorithms rather than the very computationally demanding approaches of the recent deep neural networks (DNN). In the field of cyberphysical security, many applications generate structured (tabular) data, which can be fed to all traditional ML approaches. This is in contrast to the DNN approaches, which favor unstructured data, e.g., images, text, voice, etc.; hence, the relevance of this chapter to this field.%
    Epistemic Neural Networks. (arXiv:2107.08924v3 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of \textit{joint} predictions of labels across multiple inputs. Conventional neural networks lack this capability and, since most research has focused on marginal predictions, this shortcoming has been largely overlooked. We introduce the \textit{epistemic neural network} (ENN) as an interface for models that represent uncertainty as required to generate useful joint predictions. While prior approaches to uncertainty modeling such as Bayesian neural networks can be expressed as ENNs, this new interface facilitates comparison of joint predictions and the design of novel architectures and algorithms. In particular, we introduce the \textit{epinet}: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. We demonstrate this efficacy across synthetic data, ImageNet, and some reinforcement learning tasks. As part of this effort we open-source experiment code.
    Gaussian Universality of Linear Classifiers with Random Labels in High-Dimension. (arXiv:2205.13303v1 [stat.ML])
    While classical in many theoretical settings, the assumption of Gaussian i.i.d. inputs is often perceived as a strong limitation in the analysis of high-dimensional learning. In this study, we redeem this line of work in the case of generalized linear classification with random labels. Our main contribution is a rigorous proof that data coming from a range of generative models in high-dimensions have the same minimum training loss as Gaussian data with corresponding data covariance. In particular, our theorem covers data created by an arbitrary mixture of homogeneous Gaussian clouds, as well as multi-modal generative neural networks. In the limit of vanishing regularization, we further demonstrate that the training loss is independent of the data covariance. Finally, we show that this universality property is observed in practice with real datasets and random labels.  ( 2 min )
    Distributed Contextual Linear Bandits with Minimax Optimal Communication Cost. (arXiv:2205.13170v1 [cs.LG])
    We study distributed contextual linear bandits with stochastic contexts, where $N$ agents act cooperatively to solve a linear bandit-optimization problem with $d$-dimensional features. For this problem, we propose a distributed batch elimination version of the LinUCB algorithm, DisBE-LUCB, where the agents share information among each other through a central server. We prove that over $T$ rounds ($NT$ actions in total) the communication cost of DisBE-LUCB is only $\tilde{\mathcal{O}}(dN)$ and its regret is at most $\tilde{\mathcal{O}}(\sqrt{dNT})$, which is of the same order as that incurred by an optimal single-agent algorithm for $NT$ rounds. Remarkably, we derive an information-theoretic lower bound on the communication cost of the distributed contextual linear bandit problem with stochastic contexts, and prove that our proposed algorithm is nearly minimax optimal in terms of \emph{both regret and communication cost}. Finally, we propose DecBE-LUCB, a fully decentralized version of DisBE-LUCB, which operates without a central server, where agents share information with their \emph{immediate neighbors} through a carefully designed consensus procedure.  ( 2 min )
    Towards Learning Universal Hyperparameter Optimizers with Transformers. (arXiv:2205.13320v1 [cs.LG])
    Meta-learning hyperparameter optimization (HPO) algorithms from prior experiments is a promising approach to improve optimization efficiency over objective functions from a similar distribution. However, existing methods are restricted to learning from experiments sharing the same set of hyperparameters. In this paper, we introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction when trained on vast tuning data from the wild. Our extensive experiments demonstrate that the OptFormer can imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates. Compared to a Gaussian Process, the OptFormer also learns a robust prior distribution for hyperparameter response functions, and can thereby provide more accurate and better calibrated predictions. This work paves the path to future extensions for training a Transformer-based model as a general HPO optimizer.  ( 2 min )
    Your Transformer May Not be as Powerful as You Expect. (arXiv:2205.13401v1 [cs.LG])
    Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative -- RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. One key reason lies in that most RPEs are placed in the softmax attention that always generates a right stochastic matrix. This restricts the network from capturing positional information in the RPEs and limits its capacity. To overcome the problem and make the model more powerful, we first present sufficient conditions for RPE-based Transformers to achieve universal function approximation. With the theoretical guidance, we develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions. Therefore, the corresponding URPE-based Transformers become universal function approximators. Extensive experiments covering typical architectures and tasks demonstrate that our model is parameter-efficient and can achieve superior performance to strong baselines in a wide range of applications.  ( 2 min )
    Variance-Aware Sparse Linear Bandits. (arXiv:2205.13450v1 [cs.LG])
    It is well-known that the worst-case minimax regret for sparse linear bandits is $\widetilde{\Theta}\left(\sqrt{dT}\right)$ where $d$ is the ambient dimension and $T$ is the number of time steps (ignoring the dependency on sparsity). On the other hand, in the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $\widetilde{\mathcal O}(1)$ regret, which is (nearly) independent of $d$ and $T$. In this paper, we present the first variance-aware regret guarantee for sparse linear bandits: $\widetilde{\mathcal O}\left(\sqrt{d\sum_{t=1}^T \sigma_t^2} + 1\right)$, where $\sigma_t^2$ is the variance of the noise at the $t$-th time step. This bound naturally interpolates the regret bounds for the worst-case constant-variance regime ($\sigma_t = \Omega(1)$) and the benign deterministic regimes ($\sigma_t = 0$). To achieve this variance-aware regret guarantee, we develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits in a ``black-box'' manner. Specifically, we take two recent algorithms as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.  ( 2 min )
    Uniform Generalization Bound on Time and Inverse Temperature for Gradient Descent Algorithm and its Application to Analysis of Simulated Annealing. (arXiv:2205.12959v1 [cs.LG])
    In this paper, we propose a novel uniform generalization bound on the time and inverse temperature for stochastic gradient Langevin dynamics (SGLD) in a non-convex setting. While previous works derive their generalization bounds by uniform stability, we use Rademacher complexity to make our generalization bound independent of the time and inverse temperature. Using Rademacher complexity, we can reduce the problem to derive a generalization bound on the whole space to that on a bounded region and therefore can remove the effect of the time and inverse temperature from our generalization bound. As an application of our generalization bound, an evaluation on the effectiveness of the simulated annealing in a non-convex setting is also described. For the sample size $n$ and time $s$, we derive evaluations with orders $\sqrt{n^{-1} \log (n+1)}$ and $|(\log)^4(s)|^{-1}$, respectively. Here, $(\log)^4$ denotes the $4$ times composition of the logarithmic function.  ( 2 min )
    Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback. (arXiv:2205.13451v1 [cs.LG])
    We consider regret minimization for Adversarial Markov Decision Processes (AMDPs), where the loss functions are changing over time and adversarially chosen, and the learner only observes the losses for the visited state-action pairs (i.e., bandit feedback). While there has been a surge of studies on this problem using Online-Mirror-Descent (OMD) methods, very little is known about the Follow-the-Perturbed-Leader (FTPL) methods, which are usually computationally more efficient and also easier to implement since it only requires solving an offline planning problem. Motivated by this, we take a closer look at FTPL for learning AMDPs, starting from the standard episodic finite-horizon setting. We find some unique and intriguing difficulties in the analysis and propose a workaround to eventually show that FTPL is also able to achieve near-optimal regret bounds in this case. More importantly, we then find two significant applications: First, the analysis of FTPL turns out to be readily generalizable to delayed bandit feedback with order-optimal regret, while OMD methods exhibit extra difficulties (Jin et al., 2022). Second, using FTPL, we also develop the first no-regret algorithm for learning communicating AMDPs in the infinite-horizon setting with bandit feedback and stochastic transitions. Our algorithm is efficient assuming access to an offline planning oracle, while even for the easier full-information setting, the only existing algorithm (Chandrasekaran and Tewari, 2021) is computationally inefficient.  ( 2 min )
    Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency. (arXiv:2205.13476v1 [cs.LG])
    Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/\epsilon^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.  ( 2 min )
    Fair Representation Learning through Implicit Path Alignment. (arXiv:2205.13316v1 [cs.LG])
    We consider a fair representation learning perspective, where optimal predictors, on top of the data representation, are ensured to be invariant with respect to different sub-groups. Specifically, we formulate this intuition as a bi-level optimization, where the representation is learned in the outer-loop, and invariant optimal group predictors are updated in the inner-loop. Moreover, the proposed bi-level objective is demonstrated to fulfill the sufficiency rule, which is desirable in various practical scenarios but was not commonly studied in the fair learning. Besides, to avoid the high computational and memory cost of differentiating in the inner-loop of bi-level objective, we propose an implicit path alignment algorithm, which only relies on the solution of inner optimization and the implicit differentiation rather than the exact optimization path. We further analyze the error gap of the implicit approach and empirically validate the proposed method in both classification and regression settings. Experimental results show the consistently better trade-off in prediction performance and fairness measurement.  ( 2 min )
    Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization. (arXiv:2205.13209v1 [cs.LG])
    Deep reinforcement learning (DRL)-based combinatorial optimization (CO) methods (i.e., DRL-NCO) have shown significant merit over the conventional CO solvers as DRL-NCO is capable of learning CO solvers without supervised labels attained from the verified solver. This paper presents a novel training scheme, Sym-NCO, that achieves significant performance increments to existing DRL-NCO methods. Sym-NCO is a regularizer-based training scheme that leverages universal symmetricities in various CO problems and solutions. Imposing symmetricities such as rotational and reflectional invariance can greatly improve generalization capability of DRL-NCO as symmetricities are invariant features shared by certain CO tasks. Our experimental results verify that our Sym-NCO greatly improves the performance of DRL-NCO methods in four CO tasks, including traveling salesman problem (TSP), capacitated vehicle routing problem (CVRP), prize collecting TSP (PCTSP), and orienteering problem (OP), without employing problem-specific techniques. Remarkably, Sym-NCO outperformed not only the existing DRL-NCO methods but also a competitive conventional solver, the iterative local search (ILS), in PCTSP at 240 times faster speed.  ( 2 min )
    Active Labeling: Streaming Stochastic Gradients. (arXiv:2205.13255v1 [cs.LG])
    The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which generalizes active learning based on partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over number of samples. We illustrate our technique in depth for robust regression.  ( 2 min )
    Identifying Patient-Specific Root Causes with the Heteroscedastic Noise Model. (arXiv:2205.13085v1 [stat.ML])
    Complex diseases are caused by a multitude of factors that may differ between patients even within the same diagnostic category. A few underlying root causes may nevertheless initiate the development of disease within each patient. We therefore focus on identifying patient-specific root causes of disease, which we equate to the sample-specific predictivity of the exogenous error terms in a structural equation model. We generalize from the linear setting to the heteroscedastic noise model where $Y = m(X) + \varepsilon\sigma(X)$ with non-linear functions $m(X)$ and $\sigma(X)$ representing the conditional mean and mean absolute deviation, respectively. This model preserves identifiability but introduces non-trivial challenges that require a customized algorithm called Generalized Root Causal Inference (GRCI) to extract the error terms correctly. GRCI recovers patient-specific root causes more accurately than existing alternatives.  ( 2 min )
    Factorized Structured Regression for Large-Scale Varying Coefficient Models. (arXiv:2205.13080v1 [stat.ML])
    Recommender Systems (RS) pervade many aspects of our everyday digital life. Proposed to work at scale, state-of-the-art RS allow the modeling of thousands of interactions and facilitate highly individualized recommendations. Conceptually, many RS can be viewed as instances of statistical regression models that incorporate complex feature effects and potentially non-Gaussian outcomes. Such structured regression models, including time-aware varying coefficients models, are, however, limited in their applicability to categorical effects and inclusion of a large number of interactions. Here, we propose Factorized Structured Regression (FaStR) for scalable varying coefficient models. FaStR overcomes limitations of general regression models for large-scale data by combining structured additive regression and factorization approaches in a neural network-based model implementation. This fusion provides a scalable framework for the estimation of statistical models in previously infeasible data settings. Empirical results confirm that the estimation of varying coefficients of our approach is on par with state-of-the-art regression techniques, while scaling notably better and also being competitive with other time-aware RS in terms of prediction performance. We illustrate FaStR's performance and interpretability on a large-scale behavioral study with smartphone user data.  ( 2 min )
    On Learning Mixture of Linear Regressions in the Non-Realizable Setting. (arXiv:2205.13166v1 [stat.ML])
    While mixture of linear regressions (MLR) is a well-studied topic, prior works usually do not analyze such models for prediction error. In fact, {\em prediction} and {\em loss} are not well-defined in the context of mixtures. In this paper, first we show that MLR can be used for prediction where instead of predicting a label, the model predicts a list of values (also known as {\em list-decoding}). The list size is equal to the number of components in the mixture, and the loss function is defined to be minimum among the losses resulted by all the component models. We show that with this definition, a solution of the empirical risk minimization (ERM) achieves small probability of prediction error. This begs for an algorithm to minimize the empirical risk for MLR, which is known to be computationally hard. Prior algorithmic works in MLR focus on the {\em realizable} setting, i.e., recovery of parameters when data is probabilistically generated by a mixed linear (noisy) model. In this paper we show that a version of the popular alternating minimization (AM) algorithm finds the best fit lines in a dataset even when a realizable model is not assumed, under some regularity conditions on the dataset and the initial points, and thereby provides a solution for the ERM. We further provide an algorithm that runs in polynomial time in the number of datapoints, and recovers a good approximation of the best fit lines. The two algorithms are experimentally compared.  ( 2 min )
    Learning the spatio-temporal relationship between wind and significant wave height using deep learning. (arXiv:2205.13325v1 [stat.ML])
    Ocean wave climate has a significant impact on near-shore and off-shore human activities, and its characterisation can help in the design of ocean structures such as wave energy converters and sea dikes. Therefore, engineers need long time series of ocean wave parameters. Numerical models are a valuable source of ocean wave data; however, they are computationally expensive. Consequently, statistical and data-driven approaches have gained increasing interest in recent decades. This work investigates the spatio-temporal relationship between North Atlantic wind and significant wave height (Hs) at an off-shore location in the Bay of Biscay, using a two-stage deep learning model. The first step uses convolutional neural networks (CNNs) to extract the spatial features that contribute to Hs. Then, long short-term memory (LSTM) is used to learn the long-term temporal dependencies between wind and waves.  ( 2 min )
    Optimal Neural Network Approximation of Wasserstein Gradient Direction via Convex Optimization. (arXiv:2205.13098v1 [cs.LG])
    The computation of Wasserstein gradient direction is essential for posterior sampling problems and scientific computing. The approximation of the Wasserstein gradient with finite samples requires solving a variational problem. We study the variational problem in the family of two-layer networks with squared-ReLU activations, towards which we derive a semi-definite programming (SDP) relaxation. This SDP can be viewed as an approximation of the Wasserstein gradient in a broader function family including two-layer networks. By solving the convex SDP, we obtain the optimal approximation of the Wasserstein gradient direction in this class of functions. Numerical experiments including PDE-constrained Bayesian inference and parameter estimation in COVID-19 modeling demonstrate the effectiveness of the proposed method.  ( 2 min )
    Efficient and Near-Optimal Smoothed Online Learning for Generalized Linear Functions. (arXiv:2205.13056v1 [stat.ML])
    Due to the drastic gap in complexity between sequential and batch statistical learning, recent work has studied a smoothed sequential learning setting, where Nature is constrained to select contexts with density bounded by 1/{\sigma} with respect to a known measure {\mu}. Unfortunately, for some function classes, there is an exponential gap between the statistically optimal regret and that which can be achieved efficiently. In this paper, we give a computationally efficient algorithm that is the first to enjoy the statistically optimal log(T/{\sigma}) regret for realizable K-wise linear classification. We extend our results to settings where the true classifier is linear in an over-parameterized polynomial featurization of the contexts, as well as to a realizable piecewise-regression setting assuming access to an appropriate ERM oracle. Somewhat surprisingly, standard disagreement-based analyses are insufficient to achieve regret logarithmic in 1/{\sigma}. Instead, we develop a novel characterization of the geometry of the disagreement region induced by generalized linear classifiers. Along the way, we develop numerous technical tools of independent interest, including a general anti-concentration bound for the determinant of certain matrix averages.  ( 2 min )
    Classification ensembles for multivariate functional data with application to mouse movements in web surveys. (arXiv:2205.13380v1 [stat.ME])
    We propose new ensemble models for multivariate functional data classification as combinations of semi-metric-based weak learners. Our models extend current semi-metric-type methods from the univariate to the multivariate case, propose new semi-metrics to compute distances between functions, and consider more flexible options for combining weak learners using stacked generalisation methods. We apply these ensemble models to identify respondents' difficulty with survey questions, with the aim to improve survey data quality. As predictors of difficulty, we use mouse movement trajectories from the respondents' interaction with a web survey, in which several questions were manipulated to create two scenarios with different levels of difficulty.  ( 2 min )
    Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification. (arXiv:2205.13094v1 [cs.LG])
    While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an $\textit{undersampled}$ dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in real-world datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. While in the case of group-covariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory the test accuracy of robust neural network classifiers is constrained by the number of minority samples.  ( 2 min )
    Preference Dynamics Under Personalized Recommendations. (arXiv:2205.13026v1 [cs.LG])
    Many projects (both practical and academic) have designed algorithms to match users to content they will enjoy under the assumption that user's preferences and opinions do not change with the content they see. Evidence suggests that individuals' preferences are directly shaped by what content they see -- radicalization, rabbit holes, polarization, and boredom are all example phenomena of preferences affected by content. Polarization in particular can occur even in ecosystems with "mass media," where no personalization takes place, as recently explored in a natural model of preference dynamics by~\citet{hkazla2019geometric} and~\citet{gaitonde2021polarization}. If all users' preferences are drawn towards content they already like, or are repelled from content they already dislike, uniform consumption of media leads to a population of heterogeneous preferences converging towards only two poles. In this work, we explore whether some phenomenon akin to polarization occurs when users receive \emph{personalized} content recommendations. We use a similar model of preference dynamics, where an individual's preferences move towards content the consume and enjoy, and away from content they consume and dislike. We show that standard user reward maximization is an almost trivial goal in such an environment (a large class of simple algorithms will achieve only constant regret). A more interesting objective, then, is to understand under what conditions a recommendation algorithm can ensure stationarity of user's preferences. We show how to design a content recommendations which can achieve approximate stationarity, under mild conditions on the set of available content, when a user's preferences are known, and how one can learn enough about a user's preferences to implement such a strategy even when user preferences are initially unknown.  ( 2 min )

  • Open

    [P] Looking for datasets containing slide decks / powerpoint presentations
    I'm currently working on a project that aims to generate summaries of slide decks / powerpoint presentations. I'm looking to augment the dataset that I have with some additional slides (hopefully but not necessarily paired with summaries), and wondered if anyone here has come across a dataset I could use. My initial thoughts were to use research paper abstracts paired with conference presentation slides. I was able to collect such a dataset from past ICML publications, however many of the slides are overly technical and lack natural language (e.g. presentations consisting of nothing but equations). Additionally these presentations are quite bit out of distribution with respect to the domain I'd be applying it to (business). So, my question: does anyone know of any publicly available datasets that are composed of slides/presentations, or of any other resources like conferences (particularly in business oriented fields, e.g., finance, marketing, business administration, operations management) that have easily accessible pairs of slide presentations and summaries (either paper abstracts or other forms)? Thanks in advance! submitted by /u/mamossa [link] [comments]  ( 1 min )
    [D] Understanding the fundamental maths vs just use existing Python libraries?
    I'm interested in developing trading algorithms (not bots) using machine learning techniques. I'm a very experienced software developer, I've worked in trading for many years and I am mathematical (couple of masters degrees in engineering- did Fourier Analysis, Laplace, autocorrelation, statistics etc) but haven't used it for years. Whilst looking for machine learning books there appear to be two approaches: Practical/use existing Python libraries (Hands on Machine Learning with Scikit......) Theory, understand the fundamental statistics (The Elements of Statistical Learning) Do I need to/what is the advantage of understanding the fundamental statistics? Is it possible to simply use existing ML libraries, configure parameters and run simulations? If I need to teach myself it, I will. But if there's very small benefit it's better I invest my time elsewhere. Any additional advice would be greatly appreciated. I'm not sure what holding-period yet (HFT or statistical arbitrage) but please feel free to distinguish if the answer differs. submitted by /u/reddit_faa7777 [link] [comments]  ( 2 min )
    [R] An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems - Google 2022 - Jeff Dean
    Paper: https://arxiv.org/abs/2205.12755 Abstract: "Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer, a key feature of human learning. Though, state of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks. Also, continual learning, that adds the temporal aspect to multitask, is often focused to the study of common pitfalls such as catastrophic forgetting instead of being studied at a large scale as a critical component to build the next generation artificial intelligence. We propose an evolutionary method that can generate a large scale multitask model, and can support the dynamic and continuous addition of new tasks. The generated multitask model is sparsely activated and integrates a task-based routing that guarantees bounded compute cost and fewer added parameters per task as the model expands. The proposed method relies on a knowledge compartmentalization technique to achieve immunity against catastrophic forgetting and other common pitfalls such as gradient interference and negative transfer. We empirically show that the proposed method can jointly solve and achieve competitive results on 69image classification tasks, for example achieving the best test accuracy reported for a model trained only on public data for competitive tasks such as cifar10: 99.43%." https://www.youtube.com/watch?v=Pcin4hPGaOk https://preview.redd.it/ni93muz8fv191.jpg?width=1108&format=pjpg&auto=webp&s=1782b2100bcedd03db4443fa6c58ef7c4c904488 https://preview.redd.it/8txpqsgafv191.jpg?width=987&format=pjpg&auto=webp&s=f723e57e78f81ce38c19917af2db3352f8d3a47c https://preview.redd.it/qb0xj9safv191.jpg?width=1128&format=pjpg&auto=webp&s=792056a1ae7b72a36d6e429d92e13f0f11249e8e submitted by /u/Singularian2501 [link] [comments]  ( 1 min )
    [D] How can we approach this problem ?
    You have a combined dataset consisting of 10 component datasets collected from 10 different sources. Independent models trained separately on each component dataset perform well on hold-out examples from that dataset. However, the aggregated model trained by combining the examples from all component datasets behaves weirdly. On hold-out examples from some component datasets, the aggregated model performs better than the independent models. On others, it performs worse than the independent models. During deployment, you expect to see input examples from these 10 component sources but also from many other sources which the model has not been trained on. What approach will you take to develop a model that will generalize well to examples from the seen and also the yet-unseen sources? submitted by /u/corporatededmeat [link] [comments]  ( 1 min )
    [R] New datasets for StyleGAN
    Hi all, The Author is here. TL;DR: We show how StyleGAN can be adapted to raw unaligned images collected from the Internet. New datasets and models are available. ​ How can we adapt StyleGAN to more complicated datasets? We have witnessed that a data-centric approach is the most effective. Raw image collections downloaded from the internet contain many outlier images and are characterized by a multi-modal distribution. Therefore, we perform automatic self-supervised filtering of the training data to remove the outliers. Our key idea is to use the generator itself for the filtering. In the second step, we employ a multi-modal variant of the StyleGAN truncation trick. This allows high quality generation while preserving the remarkable editing capabilities of StyleGAN. For more details and cool gifs, check our Project Page:https://self-distilled-stylegan.github.io/ Datasets and models: https://github.com/self-distilled-stylegan/self-distilled-internet-photos The datasets also can be directly downloaded: https://github.com/rmokady/SDIP_utils Demo for image generation: https://huggingface.co/spaces/hysts/Self-Distilled-StyleGAN ​ Feel free to ask anything that comes to your mind ​ Generated Dog Generated Elephant submitted by /u/RonMokady [link] [comments]  ( 1 min )
    [R] CNNs are Myopic
    submitted by /u/downtownslim [link] [comments]  ( 3 min )
    [R] Large Language Models are Zero-Shot Reasoners. My summary: Adding text such as "Let’s think step by step" to a prompt "elicits chain of thought from large language models across a variety of reasoning tasks".
    submitted by /u/Wiskkey [link] [comments]  ( 4 min )
    [D] Semantic Segmentation/Remote Sensing Challenges
    Does anyone know of any interesting Semantic Segmentation and/or Remote Sensing competition taking place this summer? Most of what I found ends in the next 1-2 weeks. submitted by /u/incognitoacnt [link] [comments]
    [News] New Anomaly Advisor tab on Netdata
    Hello everyone, I work on the development of a host monitoring system called Netdata, a free and open source software. I found it interesting to present here the new anomalies tab which shows high anomaly rates on all its nodes using machine learning. We made a post on our blog about it in more detail. https://www.netdata.cloud/blog/introducing-anomaly-advisor-unsupervised-anomaly-detection-in-netdata/ submitted by /u/roeant [link] [comments]
  • Open

    All ML and AI E-books for $10 Sale!
    submitted by /u/alimhabidi [link] [comments]
    How do leading AIs do on human IQ tests in 2022?
    Could not find an answer to this question online. submitted by /u/ShallowStroker [link] [comments]  ( 1 min )
    In this article, we show how to labels PDFs and scanned images using OCR in order to create your training dataset for your NLP application.
    submitted by /u/UBIAI [link] [comments]  ( 1 min )
    Corporate America bets on AI productivity
    submitted by /u/mr_j_b [link] [comments]
    Latest Artificial Intelligence Technologies — AI
    submitted by /u/mr_j_b [link] [comments]
    HPE is building a rapid AI supercomputer powered by the world’s largest CPU
    submitted by /u/mr_j_b [link] [comments]
    FUTURE TECH ASIA China and Europe are leading the push to regulate A.I. — one of them could set the global playbook
    submitted by /u/mr_j_b [link] [comments]  ( 1 min )
    Could AI Image-Generation technology like DALL-E and Imagen be the future of messaging?
    I feel like the application of technologies like DALL-E and Imagen would be perfect for everyday communication with friends and family, in the same way emojis and images and gifs enhance messaging today. Imagine an in-app messaging feature for iMessage or WhatsApp which allows you to type a prompt then scroll through various AI-generated images or gifs that fit the prompt to send to your friends and family rather than the emojis and gifs that most of us use now. Over recent years we have adapted to send emojis and images and gifs to express complex emotions and sentiments that would take lots of text to accomplish, but they’re still approximations because we are relying on an existing set of created images and videos. I imagine made-to-measure images and videos could be a type of casual communication in the future. submitted by /u/Psychadiculous [link] [comments]  ( 1 min )
    Microsoft AI Researchers Introduces (De)ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
    While there are several benefits to using Artificial Intelligence, there are also drawbacks to this cutting-edge technology. One example is the creation of inappropriate language by language models. Because these models are trained on massive amounts of data, inappropriate language may be learned due to its presence in the training data. In only specific cases, content moderation techniques can be used to flag or filter such language however, the datasets used to train these programs frequently fail to capture the complexity of potentially unsuitable and poisonous language, particularly hate speech. Furthermore, the neutral samples in these datasets seldom contain group references. As a result, tools may flag even neutral language that refers to a minority identification group as hate speech. A dataset needs to be created for training content moderation algorithms that may be used to detect better implicitly harmful material, inspired by big language models’ capacity to emulate the tone, style, and vocabulary of cues they receive, whether toxic or benign. Continue Reading | Check out the paper, Microsoft blog and Github codes https://preview.redd.it/zkushg4jdu191.png?width=1302&format=png&auto=webp&s=29bbf108b18539083d478ea718078cf3502962b7 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    AI-art isn't art
    submitted by /u/estasfuera [link] [comments]  ( 14 min )
    New framework for firms to ensure AI services are reliable, safe
    submitted by /u/kuang89 [link] [comments]  ( 1 min )
    AI Dream 53 - Cosmic Birth | 300 SUBS CELEBRATION
    submitted by /u/LordPewPew777 [link] [comments]
  • Open

    Help needed
    Please, does anyone have experience in using Gaussian processes to update q values instead of a deep learning network in an RL setting? I have separate codes for both but I need some help in merging them for a project. I will be happy to pay for the consultation if needed and feel free to give a referral if you know a friend or have access to some helpful resources. This RL is in an off-policy setting and for a static dataset. submitted by /u/Thin-Ad9581 [link] [comments]  ( 1 min )
    Classic control from pixels
    Hey all I'm trying to play around with off-policy learning from pixels/images. To start easy, I thought it would be a good idea to take a classic control environment like the Pendulum-v1 and try to solve it from pixels directly. To do this, I decided to wrap the gym env in a custom wrapper which seems to be working for all intents and purposes: class ImgObservationWrapper(Wrapper): def __init__(self, env): super(ImgObservationWrapper, self).__init__(env) self.reset() dummy_obs = self.process_img(env.render(mode="rgb_array")) self.observation_space = spaces.Box( low=0, high=255, shape=dummy_obs.shape, dtype=dummy_obs.dtype ) def reset(self, **kwargs): obs = self.env.reset(**kwargs) obs = self.process_img(self.env.render(mode="rgb_array")) return obs def process_img(self, observation, crop…  ( 1 min )
    Help with MADDPG on Food Collector (Unity ML-Agents)
    Can somebody help me with MADDPG? I am trying to apply it to Food Collector (Unity ML-Agents), and my model is constantly converging to action extremes (either -1 or 1) after only couple of steps. There are 5 agents, 40x40x5 obs and 3 continuous actions and 1 discrete. Buffer size is 10k, agent is taking random actions for first 10k steps. https://preview.redd.it/9gon27amsu191.png?width=725&format=png&auto=webp&s=ebe9f3552b9e39b2f480cb01ad35cd6e3ae21092 These are actions of agent#1, so you can see example of model quickly converging to -1 or 1 (in this case -1, -1, 1) ​ submitted by /u/TheGuy839 [link] [comments]  ( 1 min )
  • Open

    Breakthrough Google AI Text to Image Imagen Beats Dalle-2 With Unprecedented Photorealism and Deep Level of Language Understanding
    submitted by /u/getrich_or_diemining [link] [comments]
  • Open

    How AI Impact the future of Human Capital Management in 2022
    How AI Impact the future of Human Capital Management in 2022  ( 5 min )
  • Open

    A Devotion to Emotion: Hume AI’s Alan Cowen on the Intersection of AI and Empathy
    Can machines experience emotions? They might, according to Hume AI, an AI research lab and technology company that aims to “ensure artificial intelligence is built to serve human goals and emotional well-being.” So how can AI genuinely understand how we are feeling, and respond appropriately? On this episode of NVIDIA’s AI Podcast, host Noah Kravitz Read article > The post A Devotion to Emotion: Hume AI’s Alan Cowen on the Intersection of AI and Empathy appeared first on NVIDIA Blog.  ( 2 min )
    Ready, Set, Game: GFN Thursday Brings 10 New Titles to GeForce NOW
    It’s a beautiful day to play video games. And it’s GFN Thursday, which means we’ve got those games. Ten total titles join the GeForce NOW library of over 1,300 games, starting with the release of Roller Champions – a speedy, free-to-play roller skating title launching with competitive season 0. Rollin’ Into the Weekend Roll with Read article > The post Ready, Set, Game: GFN Thursday Brings 10 New Titles to GeForce NOW appeared first on NVIDIA Blog.  ( 2 min )
    Deciphering the Future: HPE Switches on AI Supercomputer in France
    Recalling the French linguist who deciphered the Rosetta Stone 150 years ago, Hewlett Packard Enterprise today switched on a tool to unravel its customers’ knottiest problems. The Champollion AI supercomputer takes its name from Jean-François Champollion (1790-1832), who decoded hieroglyphics that opened a door to study of ancient Egypt’s culture. Like Champollion, the mega-system resides Read article > The post Deciphering the Future: HPE Switches on AI Supercomputer in France appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Interactive Packaging: How to Make Packaging Smarter with AI and IoT
    Today, Packaging is not just about covering a product for a better sale. Interactive packaging is a new trend in the packaging industry which mainly focuses on customer satisfaction and engagement. BLE codes, AI, and IoT are key technologies in interactive packaging which is helping users with the product attributes and user instructions. Also, Incorporation… Read More »Interactive Packaging: How to Make Packaging Smarter with AI and IoT The post Interactive Packaging: How to Make Packaging Smarter with AI and IoT appeared first on Data Science Central.  ( 2 min )
    How to Choose the Best Salon CRM Software
    As you may know, everyone desires to look good or better than everyone else. To look pretty, cooler, smarter, and bold people go to the salons. Because a salon is a place that takes care of your outer beauty and makes you look prettier than before. They use different products on your body and which… Read More »How to Choose the Best Salon CRM Software The post How to Choose the Best Salon CRM Software appeared first on Data Science Central.  ( 5 min )
  • Open

    Deep interpretable ensembles. (arXiv:2205.12729v1 [stat.ML])
    Ensembles improve prediction performance and allow uncertainty quantification by aggregating predictions from multiple models. In deep ensembling, the individual models are usually black box neural networks, or recently, partially interpretable semi-structured deep transformation models. However, interpretability of the ensemble members is generally lost upon aggregation. This is a crucial drawback of deep ensembles in high-stake decision fields, in which interpretable models are desired. We propose a novel transformation ensemble which aggregates probabilistic predictions with the guarantee to preserve interpretability and yield uniformly better predictions than the ensemble members on average. Transformation ensembles are tailored towards interpretable deep transformation models but are applicable to a wider range of probabilistic neural networks. In experiments on several publicly available data sets, we demonstrate that transformation ensembles perform on par with classical deep ensembles in terms of prediction performance, discrimination, and calibration. In addition, we demonstrate how transformation ensembles quantify both aleatoric and epistemic uncertainty, and produce minimax optimal predictions under certain conditions.  ( 2 min )
    MAVIPER: Learning Decision Tree Policies for Interpretable Multi-Agent Reinforcement Learning. (arXiv:2205.12449v1 [cs.LG])
    Many recent breakthroughs in multi-agent reinforcement learning (MARL) require the use of deep neural networks, which are challenging for human experts to interpret and understand. On the other hand, existing work on interpretable RL has shown promise in extracting more interpretable decision tree-based policies, but only in the single-agent setting. To fill this gap, we propose the first set of interpretable MARL algorithms that extract decision-tree policies from neural networks trained with MARL. The first algorithm, IVIPER, extends VIPER, a recent method for single-agent interpretable RL, to the multi-agent setting. We demonstrate that IVIPER can learn high-quality decision-tree policies for each agent. To better capture coordination between agents, we propose a novel centralized decision-tree training algorithm, MAVIPER. MAVIPER jointly grows the trees of each agent by predicting the behavior of the other agents using their anticipated trees, and uses resampling to focus on states that are critical for its interactions with other agents. We show that both algorithms generally outperform the baselines and that MAVIPER-trained agents achieve better-coordinated performance than IVIPER-trained agents on three different multi-agent particle-world environments.  ( 2 min )
    ORCA: Interpreting Prompted Language Models via Locating Supporting Data Evidence in the Ocean of Pretraining Data. (arXiv:2205.12600v1 [cs.CL])
    Large pretrained language models have been performing increasingly well in a variety of downstream tasks via prompting. However, it remains unclear from where the model learns the task-specific knowledge, especially in a zero-shot setup. In this work, we want to find evidence of the model's task-specific competence from pretraining and are specifically interested in locating a very small subset of pretraining data that directly supports the model in the task. We call such a subset supporting data evidence and propose a novel method ORCA to effectively identify it, by iteratively using gradient information related to the downstream task. This supporting data evidence offers interesting insights about the prompted language models: in the tasks of sentiment analysis and textual entailment, BERT shows a substantial reliance on BookCorpus, the smaller corpus of BERT's two pretraining corpora, as well as on pretraining examples that mask out synonyms to the task verbalizers.  ( 2 min )
    Online Metro Origin-Destination Prediction via Heterogeneous Information Aggregation. (arXiv:2107.00946v5 [cs.LG] UPDATED)
    Metro origin-destination prediction is a crucial yet challenging time-series analysis task in intelligent transportation systems, which aims to accurately forecast two specific types of cross-station ridership, i.e., Origin-Destination (OD) one and Destination-Origin (DO) one. However, complete OD matrices of previous time intervals can not be obtained immediately in online metro systems, and conventional methods only used limited information to forecast the future OD and DO ridership separately. In this work, we proposed a novel neural network module termed Heterogeneous Information Aggregation Machine (HIAM), which fully exploits heterogeneous information of historical data (e.g., incomplete OD matrices, unfinished order vectors, and DO matrices) to jointly learn the evolutionary patterns of OD and DO ridership. Specifically, an OD modeling branch estimates the potential destinations of unfinished orders explicitly to complement the information of incomplete OD matrices, while a DO modeling branch takes DO matrices as input to capture the spatial-temporal distribution of DO ridership. Moreover, a Dual Information Transformer is introduced to propagate the mutual information among OD features and DO features for modeling the OD-DO causality and correlation. Based on the proposed HIAM, we develop a unified Seq2Seq network to forecast the future OD and DO ridership simultaneously. Extensive experiments conducted on two large-scale benchmarks demonstrate the effectiveness of our method for online metro origin-destination prediction. Our code is resealed at https://github.com/HCPLab-SYSU/HIAM.  ( 2 min )
    Surprises in adversarially-trained linear regression. (arXiv:2205.12695v1 [stat.ML])
    State-of-the-art machine learning models can be vulnerable to very small input perturbations that are adversarially constructed. Adversarial training is one of the most effective approaches to defend against such examples. We show that for linear regression problems, adversarial training can be formulated as a convex problem. This fact is then used to show that $\ell_\infty$-adversarial training produces sparse solutions and has many similarities to the lasso method. Similarly, $\ell_2$-adversarial training has similarities with ridge regression. We use a robust regression framework to analyze and understand these similarities and also point to some differences. Finally, we show how adversarial training behaves differently from other regularization methods when estimating overparameterized models (i.e., models with more parameters than datapoints). It minimizes a sum of three terms which regularizes the solution, but unlike lasso and ridge regression, it can sharply transition into an interpolation mode. We show that for sufficiently many features or sufficiently small regularization parameters, the learned model perfectly interpolates the training data while still exhibiting good out-of-sample performance.  ( 2 min )
    Differentially Private AUC Computation in Vertical Federated Learning. (arXiv:2205.12412v1 [cs.LG])
    Federated learning has gained great attention recently as a privacy-enhancing tool to jointly train a machine learning model by multiple parties. As a sub-category, vertical federated learning (vFL) focuses on the scenario where features and labels are split into different parties. The prior work on vFL has mostly studied how to protect label privacy during model training. However, model evaluation in vFL might also lead to potential leakage of private label information. One mitigation strategy is to apply label differential privacy (DP) but it gives bad estimations of the true (non-private) metrics. In this work, we propose two evaluation algorithms that can more accurately compute the widely used AUC (area under curve) metric when using label DP in vFL. Through extensive experiments, we show our algorithms can achieve more accurate AUCs compared to the baselines.  ( 2 min )
    Wavelet Feature Maps Compression for Image-to-Image CNNs. (arXiv:2205.12268v1 [cs.CV])
    Convolutional Neural Networks (CNNs) are known for requiring extensive computational resources, and quantization is among the best and most common methods for compressing them. While aggressive quantization (i.e., less than 4-bits) performs well for classification, it may cause severe performance degradation in image-to-image tasks such as semantic segmentation and depth estimation. In this paper, we propose Wavelet Compressed Convolution (WCC) -- a novel approach for high-resolution activation maps compression integrated with point-wise convolutions, which are the main computational cost of modern architectures. To this end, we use an efficient and hardware-friendly Haar-wavelet transform, known for its effectiveness in image compression, and define the convolution on the compressed activation map. We experiment on various tasks, that benefit from high-resolution input, and by combining WCC with light quantization, we achieve compression rates equivalent to 1-4bit activation quantization with relatively small and much more graceful degradation in performance.  ( 2 min )
    Giga-scale Kernel Matrix Vector Multiplication on GPU. (arXiv:2202.01085v2 [math.NA] UPDATED)
    Kernel matrix-vector multiplication (KMVM) is a foundational operation in machine learning and scientific computing. However, as KMVM tends to scale quadratically in both memory and time, applications are often limited by these computational constraints. In this paper, we propose a novel approximation procedure coined \textit{Faster-Fast and Free Memory Method} ($\text{F}^3$M) to address these scaling issues of KMVM for tall~($10^8\sim 10^9$) and skinny~($D\leq7$) data. Extensive experiments demonstrate that $\text{F}^3$M has empirical \emph{linear time and memory} complexity with a relative error of order $10^{-3}$ and can compute a full KMVM for a billion points \emph{in under a minute} on a high-end GPU, leading to a significant speed-up in comparison to existing CPU methods. We demonstrate the utility of our procedure by applying it as a drop-in for the state-of-the-art GPU-based linear solver FALKON, \emph{improving speed 1.5-5.5 times} at the cost of $<1\%$ drop in accuracy. We further demonstrate competitive results on \emph{Gaussian Process regression} coupled with significant speedups on a variety of real-world datasets.  ( 2 min )
    Sketch guided and progressive growing GAN for realistic and editable ultrasound image synthesis. (arXiv:2204.06929v3 [eess.IV] UPDATED)
    Ultrasound (US) imaging is widely used for anatomical structure inspection in clinical diagnosis. The training of new sonographers and deep learning based algorithms for US image analysis usually requires a large amount of data. However, obtaining and labeling large-scale US imaging data are not easy tasks, especially for diseases with low incidence. Realistic US image synthesis can alleviate this problem to a great extent. In this paper, we propose a generative adversarial network (GAN) based image synthesis framework. Our main contributions include: 1) we present the first work that can synthesize realistic B-mode US images with high-resolution and customized texture editing features; 2) to enhance structural details of generated images, we propose to introduce auxiliary sketch guidance into a conditional GAN. We superpose the edge sketch onto the object mask and use the composite mask as the network input; 3) to generate high-resolution US images, we adopt a progressive training strategy to gradually generate high-resolution images from low-resolution images. In addition, a feature loss is proposed to minimize the difference of high-level features between the generated and real images, which further improves the quality of generated images; 4) the proposed US image synthesis method is quite universal and can also be generalized to the US images of other anatomical structures besides the three ones tested in our study (lung, hip joint, and ovary); 5) extensive experiments on three large US image datasets are conducted to validate our method. Ablation studies, customized texture editing, user studies, and segmentation tests demonstrate promising results of our method in synthesizing realistic US images.  ( 3 min )
    TrustGNN: Graph Neural Network based Trust Evaluation via Learnable Propagative and Composable Nature. (arXiv:2205.12784v1 [cs.LG])
    Trust evaluation is critical for many applications such as cyber security, social communication and recommender systems. Users and trust relationships among them can be seen as a graph. Graph neural networks (GNNs) show their powerful ability for analyzing graph-structural data. Very recently, existing work attempted to introduce the attributes and asymmetry of edges into GNNs for trust evaluation, while failed to capture some essential properties (e.g., the propagative and composable nature) of trust graphs. In this work, we propose a new GNN based trust evaluation method named TrustGNN, which integrates smartly the propagative and composable nature of trust graphs into a GNN framework for better trust evaluation. Specifically, TrustGNN designs specific propagative patterns for different propagative processes of trust, and distinguishes the contribution of different propagative processes to create new trust. Thus, TrustGNN can learn comprehensive node embeddings and predict trust relationships based on these embeddings. Experiments on some widely-used real-world datasets indicate that TrustGNN significantly outperforms the state-of-the-art methods. We further perform analytical experiments to demonstrate the effectiveness of the key designs in TrustGNN.  ( 2 min )
    To Impute or not to Impute? Missing Data in Treatment Effect Estimation. (arXiv:2202.02096v3 [stat.ML] UPDATED)
    Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the individual and the outcome. Having a treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment divides the population in distinct subpopulations, where estimates across these populations will be biased. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data.  ( 2 min )
    Cross Domain Few-Shot Learning via Meta Adversarial Training. (arXiv:2202.05713v3 [cs.LG] UPDATED)
    Few-shot relation classification (RC) is one of the critical problems in machine learning. Current research merely focuses on the set-ups that both training and testing are from the same domain. However, in practice, this assumption is not always guaranteed. In this study, we present a novel model that takes into consideration the afore-mentioned cross-domain situation. Not like previous models, we only use the source domain data to train the prototypical networks and test the model on target domain data. A meta-based adversarial training framework (MBATF) is proposed to fine-tune the trained networks for adapting to data from the target domain. Empirical studies confirm the effectiveness of the proposed model.  ( 2 min )
    HEBO Pushing The Limits of Sample-Efficient Hyperparameter Optimisation. (arXiv:2012.03826v6 [cs.LG] UPDATED)
    In this work we rigorously analyse assumptions inherent to black-box optimisation hyper-parameter tuning tasks. Our results on the Bayesmark benchmark indicate that heteroscedasticity and non-stationarity pose significant challenges for black-box optimisers. Based on these findings, we propose a Heteroscedastic and Evolutionary Bayesian Optimisation solver (HEBO). HEBO performs non-linear input and output warping, admits exact marginal log-likelihood optimisation and is robust to the values of learned parameters. We demonstrate HEBO's empirical efficacy on the NeurIPS 2020 Black-Box Optimisation challenge, where HEBO placed first. Upon further analysis, we observe that HEBO significantly outperforms existing black-box optimisers on 108 machine learning hyperparameter tuning tasks comprising the Bayesmark benchmark. Our findings indicate that the majority of hyper-parameter tuning tasks exhibit heteroscedasticity and non-stationarity, multi-objective acquisition ensembles with Pareto front solutions improve queried configurations, and robust acquisition maximisers afford empirical advantages relative to their non-robust counterparts. We hope these findings may serve as guiding principles for practitioners of Bayesian optimisation. All code is made available at https://github.com/huawei-noah/HEBO.  ( 2 min )
    Differentially Private Data Generation Needs Better Features. (arXiv:2205.12900v1 [stat.ML])
    Training even moderately-sized generative models with differentially-private stochastic gradient descent (DP-SGD) is difficult: the required level of noise for reasonable levels of privacy is simply too large. We advocate instead building off a good, relevant representation on public data, then using private data only for "transfer learning." In particular, we minimize the maximum mean discrepancy (MMD) between private target data and the generated distribution, using a kernel based on perceptual features from a public dataset. With the MMD, we can simply privatize the data-dependent term once and for all, rather than introducing noise at each step of optimization as in DP-SGD. Our algorithm allows us to generate CIFAR10-level images faithfully with $\varepsilon \approx 2$, far surpassing the current state of the art, which only models MNIST and FashionMNIST at $\varepsilon \approx 10$. Our work introduces simple yet powerful foundations for reducing the gap between private and non-private deep generative models.  ( 2 min )
    Deep Reinforcement Learning Guided Graph Neural Networks for Brain Network Analysis. (arXiv:2203.10093v2 [cs.LG] UPDATED)
    Modern neuroimaging techniques, such as diffusion tensor imaging (DTI) and functional magnetic resonance imaging (fMRI), enable us to model the human brain as a brain network or connectome. Capturing brain networks' structural information and hierarchical patterns is essential for understanding brain functions and disease states. Recently, the promising network representation learning capability of graph neural networks (GNNs) has prompted many GNN-based methods for brain network analysis to be proposed. Specifically, these methods apply feature aggregation and global pooling to convert brain network instances into meaningful low-dimensional representations used for downstream brain network analysis tasks. However, existing GNN-based methods often neglect that brain networks of different subjects may require various aggregation iterations and use GNN with a fixed number of layers to learn all brain networks. Therefore, how to fully release the potential of GNNs to promote brain network analysis is still non-trivial. To solve this problem, we propose a novel brain network representation framework, namely BN-GNN, which searches for the optimal GNN architecture for each brain network. Concretely, BN-GNN employs deep reinforcement learning (DRL) to train a meta-policy to automatically determine the optimal number of feature aggregations (reflected in the number of GNN layers) required for a given brain network. Extensive experiments on eight real-world brain network datasets demonstrate that our proposed BN-GNN improves the performance of traditional GNNs on different brain network analysis tasks.  ( 2 min )
    Robust Reinforcement Learning on Graphs for Logistics optimization. (arXiv:2205.12888v1 [cs.LG])
    Logistics optimization nowadays is becoming one of the hottest areas in the AI community. In the past year, significant advancements in the domain were achieved by representing the problem in a form of graph. Another promising area of research was to apply reinforcement learning algorithms to the above task. In our work, we made advantage of using both approaches and apply reinforcement learning on a graph. To do that, we have analyzed the most recent results in both fields and selected SOTA algorithms both from graph neural networks and reinforcement learning. Then, we combined selected models on the problem of AMOD systems optimization for the transportation network of New York city. Our team compared three algorithms - GAT, Pro-CNN and PTDNet - to bring to the fore the important nodes on a graph representation. Finally, we achieved SOTA results on AMOD systems optimization problem employing PTDNet with GNN and training them in reinforcement fashion. Keywords: Graph Neural Network (GNN), Logistics optimization, Reinforcement Learning  ( 2 min )
    Residual-Concatenate Neural Network with Deep Regularization Layers for Binary Classification. (arXiv:2205.12775v1 [cs.LG])
    Many complex Deep Learning models are used with different variations for various prognostication tasks. The higher learning parameters not necessarily ensure great accuracy. This can be solved by considering changes in very deep models with many regularization based techniques. In this paper we train a deep neural network that uses many regularization layers with residual and concatenation process for best fit with Polycystic Ovary Syndrome Diagnosis prognostication. The network was built with improvements from every step of failure to meet the needs of the data and achieves an accuracy of 99.3% seamlessly.  ( 2 min )
    Gradient-based explanations for Gaussian Process regression and classification models. (arXiv:2205.12797v1 [cs.LG])
    Gaussian Processes (GPs) have proven themselves as a reliable and effective method in probabilistic Machine Learning. Thanks to recent and current advances, modeling complex data with GPs is becoming more and more feasible. Thus, these types of models are, nowadays, an interesting alternative to Neural and Deep Learning methods, which are arguably the current state-of-the-art in Machine Learning. For the latter, we see an increasing interest in so-called explainable approaches - in essence methods that aim to make a Machine Learning model's decision process transparent to humans. Such methods are particularly needed when illogical or biased reasoning can lead to actual disadvantageous consequences for humans. Ideally, explainable Machine Learning should help detect such flaws in a model and aid a subsequent debugging process. One active line of research in Machine Learning explainability are gradient-based methods, which have been successfully applied to complex neural networks. Given that GPs are closed under differentiation, gradient-based explainability for GPs appears as a promising field of research. This paper is primarily focused on explaining GP classifiers via gradients where, contrary to GP regression, derivative GPs are not straightforward to obtain.  ( 2 min )
    Augmentation-induced Consistency Regularization for Classification. (arXiv:2205.12461v1 [cs.LG])
    Deep neural networks have become popular in many supervised learning tasks, but they may suffer from overfitting when the training dataset is limited. To mitigate this, many researchers use data augmentation, which is a widely used and effective method for increasing the variety of datasets. However, the randomness introduced by data augmentation causes inevitable inconsistency between training and inference, which leads to poor improvement. In this paper, we propose a consistency regularization framework based on data augmentation, called CR-Aug, which forces the output distributions of different sub models generated by data augmentation to be consistent with each other. Specifically, CR-Aug evaluates the discrepancy between the output distributions of two augmented versions of each sample, and it utilizes a stop-gradient operation to minimize the consistency loss. We implement CR-Aug to image and audio classification tasks and conduct extensive experiments to verify its effectiveness in improving the generalization ability of classifiers. Our CR-Aug framework is ready-to-use, it can be easily adapted to many state-of-the-art network architectures. Our empirical results show that CR-Aug outperforms baseline methods by a significant margin.  ( 2 min )
    Exact Convergence Rates of the Neural Tangent Kernel in the Large Depth Limit. (arXiv:1905.13654v11 [stat.ML] UPDATED)
    Recent work by Jacot et al. (2018) has shown that training a neural network using gradient descent in parameter space is related to kernel gradient descent in function space with respect to the Neural Tangent Kernel (NTK). Lee et al. (2019) built on this result by establishing that the output of a neural network trained using gradient descent can be approximated by a linear model when the network width is large. Indeed, under regularity conditions, the NTK converges to a time-independent kernel in the infinite-width limit. This regime is often called the NTK regime. In parallel, recent works on signal propagation (Poole et al., 2016; Schoenholz et al., 2017; Hayou et al., 2019a) studied the impact of the initialization and the activation function on signal propagation in deep neural networks. In this paper, we connect these two theories by quantifying the impact of the initialization and the activation function on the NTK when the network depth becomes large. In particular, we provide a comprehensive analysis of the convergence rates of the NTK regime to the infinite depth regime.  ( 2 min )
    FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning. (arXiv:2205.07246v2 [cs.LG] UPDATED)
    Pseudo labeling and consistency regularization approaches based on confidence thresholding have made great progress in semi-supervised learning (SSL). However, we argue that existing methods might fail to adopt suitable thresholds since they either use a pre-defined / fixed threshold or an ad-hoc threshold adjusting scheme, resulting in inferior performance and slow convergence. We first analyze a motivating example to achieve some intuitions on the relationship between the desirable threshold and model's learning status. Based on the analysis, we hence propose FreeMatch to define and adjust the confidence threshold in a self-adaptive manner according to the model's learning status. We further introduce a self-adaptive class fairness regularization penalty that encourages the model to produce diverse predictions during the early stages of training. Extensive experimental results indicate the superiority of FreeMatch especially when the labeled data are extremely rare. FreeMatch achieves 5.78%, 13.59%, and 1.28% error rate reduction over the latest state-of-the-art method FlexMatch on CIFAR-10 with 1 label per class, STL-10 with 4 labels per class, and ImageNet with 100 labels per class, respectively.
    A Neural Tangent Kernel Formula for Ensembles of Soft Trees with Arbitrary Architectures. (arXiv:2205.12904v1 [cs.LG])
    A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although it can have various tree architectures, the theoretical properties of their impact are not well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with infinitely many trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.
    Amortized Inference for Causal Structure Learning. (arXiv:2205.12934v1 [cs.LG])
    Learning causal structure poses a combinatorial search problem that typically involves evaluating structures using a score or independence test. The resulting search is costly, and designing suitable scores or tests that capture prior knowledge is difficult. In this work, we propose to amortize the process of causal structure learning. Rather than searching over causal structures directly, we train a variational inference model to predict the causal structure from observational/interventional data. Our inference model acquires domain-specific inductive bias for causal discovery solely from data generated by a simulator. This allows us to bypass both the search over graphs and the hand-engineering of suitable score functions. Moreover, the architecture of our inference model is permutation invariant w.r.t. the data points and permutation equivariant w.r.t. the variables, facilitating generalization to significantly larger problem instances than seen during training. On synthetic data and semi-synthetic gene expression data, our models exhibit robust generalization capabilities under substantial distribution shift and significantly outperform existing algorithms, especially in the challenging genomics domain.
    Skill Machines: Temporal Logic Composition in Reinforcement Learning. (arXiv:2205.12532v1 [cs.LG])
    A major challenge in reinforcement learning is specifying tasks in a manner that is both interpretable and verifiable. One common approach is to specify tasks through reward machines -- finite state machines that encode the task to be solved. We introduce skill machines, a representation that can be learned directly from these reward machines that encode the solution to such tasks. We propose a framework where an agent first learns a set of base skills in a reward-free setting, and then combines these skills with the learned skill machine to produce composite behaviours specified by any regular language, such as linear temporal logics. This provides the agent with the ability to map from complex logical task specifications to near-optimal behaviours zero-shot. We demonstrate our approach in both a tabular and high-dimensional video game environment, where an agent is faced with several of these complex, long-horizon tasks. Our results indicate that the agent is capable of satisfying extremely complex task specifications, producing near optimal performance with no further learning. Finally, we demonstrate that the performance of skill machines can be improved with regular offline reinforcement learning algorithms when optimal behaviours are desired.
    Sharpness-Aware Minimization with Dynamic Reweighting. (arXiv:2112.08772v3 [cs.LG] UPDATED)
    Deep neural networks are often overparameterized and may not easily achieve model generalization. Adversarial training has shown effectiveness in improving generalization by regularizing the change of loss on top of adversarially chosen perturbations. The recently proposed sharpness-aware minimization (SAM) algorithm conducts adversarial weight perturbation, encouraging the model to converge to a flat minima. SAM finds a common adversarial weight perturbation per-batch. Although per-instance adversarial weight perturbations are stronger adversaries and they can potentially lead to better generalization performance, their computational cost is very high and thus it is impossible to use per-instance perturbations efficiently in SAM. In this paper, we tackle this efficiency bottleneck and propose sharpness-aware minimization with dynamic reweighting ({\delta}-SAM). Our theoretical analysis motivates that it is possible to approach the stronger, per-instance adversarial weight perturbations using reweighted per-batch weight perturbations. {\delta}-SAM dynamically reweights perturbation within each batch according to the theoretically principled weighting factors, serving as a good approximation to per-instance perturbation. Experiments on various natural language understanding tasks demonstrate the effectiveness of {\delta}-SAM.
    Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations. (arXiv:2205.12685v1 [cs.CL])
    Despite recent explosion in research interests, in-context learning and the precise impact of the quality of demonstrations remain elusive. While, based on current literature, it is expected that in-context learning shares a similar mechanism to supervised learning, Min et al. (2022) recently reported that, surprisingly, input-label correspondence is less important than other aspects of prompt demonstrations. Inspired by this counter-intuitive observation, we re-examine the importance of ground truth labels on in-context learning from diverse and statistical points of view. With the aid of the newly introduced metrics, i.e., Ground-truth Label Effect Ratio (GLER), demo-gain, and label sensitivity, we find that the impact of the correct input-label matching can vary according to different configurations. Expanding upon the previous key finding on the role of demonstrations, the complementary and contrastive results suggest that one might need to take more care when estimating the impact of each component in in-context learning demonstrations.
    NeuralPDE: Modelling Dynamical Systems from Data. (arXiv:2111.07671v2 [cs.LG] UPDATED)
    Many physical processes such as weather phenomena or fluid mechanics are governed by partial differential equations (PDEs). Modelling such dynamical systems using Neural Networks is an active research field. However, current methods are still very limited, as they do not exploit the knowledge about the dynamical nature of the system, require extensive prior knowledge about the governing equations or are limited to linear or first-order equations. In this work we make the observation that the Method of Lines used to solve PDEs can be represented using convolutions which makes convolutional neural networks (CNNs) the natural choice to parametrize arbitrary PDE dynamics. We combine this parametrization with differentiable ODE solvers to form the NeuralPDE Model, which explicitly takes into account the fact that the data is governed by differential equations. We show in several experiments on toy and real-world data that our model consistently outperforms state-of-the-art models used to learn dynamical systems.
    Autoformalization with Large Language Models. (arXiv:2205.12615v1 [cs.LG])
    Autoformalization is the process of automatically translating from natural language mathematics to formal specifications and proofs. A successful autoformalization system could advance the fields of formal verification, program synthesis, and artificial intelligence. While the long-term goal of autoformalization seemed elusive for a long time, we show large language models provide new prospects towards this goal. We make the surprising observation that LLMs can correctly translate a significant portion ($25.3\%$) of mathematical competition problems perfectly to formal specifications in Isabelle/HOL. We demonstrate the usefulness of this process by improving a previously introduced neural theorem prover via training on these autoformalized theorems. Our methodology results in a new state-of-the-art result on the MiniF2F theorem proving benchmark, improving the proof rate from $29.6\%$ to $35.2\%$.
    Eliciting Transferability in Multi-task Learning with Task-level Mixture-of-Experts. (arXiv:2205.12701v1 [cs.CL])
    Recent work suggests that transformer models are capable of multi-task learning on diverse NLP tasks. However, the potential of these models may be limited as they use the same set of parameters for all tasks. In contrast, humans tackle tasks in a more flexible way, by making proper presumptions on what skills and knowledge are relevant and executing only the necessary computations. Inspired by this, we propose to use task-level mixture-of-expert models, which has a collection of transformer layers (i.e., experts) and a router component to choose among these experts dynamically and flexibly. We show that the learned routing decisions and experts partially rediscover human categorization of NLP tasks -- certain experts are strongly associated with extractive tasks, some with classification tasks, and some with tasks requiring world knowledge.
    muNet: Evolving Pretrained Deep Neural Networks into Scalable Auto-tuning Multitask Systems. (arXiv:2205.10937v2 [cs.LG] UPDATED)
    Most uses of machine learning today involve training a model from scratch for a particular task, or sometimes starting with a model pretrained on a related task and then fine-tuning on a downstream task. Both approaches offer limited knowledge transfer between different tasks, time-consuming human-driven customization to individual tasks and high computational costs especially when starting from randomly initialized models. We propose a method that uses the layers of a pretrained deep neural network as building blocks to construct an ML system that can jointly solve an arbitrary number of tasks. The resulting system can leverage cross tasks knowledge transfer, while being immune from common drawbacks of multitask approaches such as catastrophic forgetting, gradients interference and negative transfer. We define an evolutionary approach designed to jointly select the prior knowledge relevant for each task, choose the subset of the model parameters to train and dynamically auto-tune its hyperparameters. Furthermore, a novel scale control method is employed to achieve quality/size trade-offs that outperform common fine-tuning techniques. Compared with standard fine-tuning on a benchmark of 10 diverse image classification tasks, the proposed model improves the average accuracy by 2.39% while using 47% less parameters per task.
    Tell me why! Explanations support learning relational and causal structure. (arXiv:2112.03753v3 [cs.LG] UPDATED)
    Inferring the abstract relational and causal structure of the world is a major challenge for reinforcement-learning (RL) agents. For humans, language--particularly in the form of explanations--plays a considerable role in overcoming this challenge. Here, we show that language can play a similar role for deep RL agents in complex environments. While agents typically struggle to acquire relational and causal knowledge, augmenting their experience by training them to predict language descriptions and explanations can overcome these limitations. We show that language can help agents learn challenging relational tasks, and examine which aspects of language contribute to its benefits. We then show that explanations can help agents to infer not only relational but also causal structure. Language can shape the way that agents to generalize out-of-distribution from ambiguous, causally-confounded training, and explanations even allow agents to learn to perform experimental interventions to identify causal relationships. Our results suggest that language description and explanation may be powerful tools for improving agent learning and generalization.
    Removing the fat from your posterior samples with margarine. (arXiv:2205.12841v1 [astro-ph.IM])
    Bayesian workflows often require the introduction of nuisance parameters, yet for core science modelling one needs access to a marginal posterior density. In this work we use masked autoregressive flows and kernel density estimators to encapsulate the marginal posterior, allowing us to compute marginal Kullback-Leibler divergences and marginal Bayesian model dimensionalities in addition to generating samples and computing marginal log probabilities. We demonstrate this in application to topical cosmological examples of the Dark Energy Survey, and global 21cm signal experiments. In addition to the computation of marginal Bayesian statistics, this work is important for further applications in Bayesian experimental design, complex prior modelling and likelihood emulation. This technique is made publicly available in the pip-installable code margarine.
    Recipe for a General, Powerful, Scalable Graph Transformer. (arXiv:2205.12454v1 [cs.LG])
    We propose a recipe on how to build a general, powerful, scalable (GPS) graph Transformer with linear complexity and state-of-the-art results on a diverse set of benchmarks. Graph Transformers (GTs) have gained popularity in the field of graph representation learning with a variety of recent publications but they lack a common foundation about what constitutes a good positional or structural encoding, and what differentiates them. In this paper, we summarize the different types of encodings with a clearer definition and categorize them as being $\textit{local}$, $\textit{global}$ or $\textit{relative}$. Further, GTs remain constrained to small graphs with few hundred nodes, and we propose the first architecture with a complexity linear to the number of nodes and edges $O(N+E)$ by decoupling the local real-edge aggregation from the fully-connected Transformer. We argue that this decoupling does not negatively affect the expressivity, with our architecture being a universal function approximator for graphs. Our GPS recipe consists of choosing 3 main ingredients: (i) positional/structural encoding, (ii) local message-passing mechanism, and (iii) global attention mechanism. We build and open-source a modular framework $\textit{GraphGPS}$ that supports multiple types of encodings and that provides efficiency and scalability both in small and large graphs. We test our architecture on 11 benchmarks and show very competitive results on all of them, show-casing the empirical benefits gained by the modularity and the combination of different strategies.
    Training Heterogeneous Features in Sequence to Sequence Tasks: Latent Enhanced Multi-filter Seq2Seq Model. (arXiv:2105.08840v3 [cs.CL] UPDATED)
    In language processing, training data with extremely large variance may lead to difficulty in the language model's convergence. It is difficult for the network parameters to adapt sentences with largely varied semantics or grammatical structures. To resolve this problem, we introduce a model that concentrates the each of the heterogeneous features in the input sentences. Building upon the encoder-decoder architecture, we design a latent-enhanced multi-filter seq2seq model (LEMS) that analyzes the input representations by introducing a latent space transformation and clustering. The representations are extracted from the final hidden state of the encoder and lie in the latent space. A latent space transformation is applied for enhancing the quality of the representations. Thus the clustering algorithm can easily separate samples based on the features of these representations. Multiple filters are trained by the features from their corresponding clusters, and the heterogeneity of the training data can be resolved accordingly. We conduct two sets of comparative experiments on semantic parsing and machine translation, using the Geo-query dataset and Multi30k English-French to demonstrate the enhancement our model has made respectively.
    RobustLR: Evaluating Robustness to Logical Perturbation in Deductive Reasoning. (arXiv:2205.12598v1 [cs.CL])
    Transformers have been shown to be able to perform deductive reasoning on a logical rulebase containing rules and statements written in English natural language. While the progress is promising, it is currently unclear if these models indeed perform logical reasoning by understanding the underlying logical semantics in the language. To this end, we propose RobustLR, a suite of evaluation datasets that evaluate the robustness of these models to minimal logical edits in rulebases and some standard logical equivalence conditions. In our experiments with RoBERTa and T5, we find that the models trained in prior works do not perform consistently on the different perturbations in RobustLR, thus showing that the models are not robust to the proposed logical perturbations. Further, we find that the models find it especially hard to learn logical negation and disjunction operators. Overall, using our evaluation sets, we demonstrate some shortcomings of the deductive reasoning-based language models, which can eventually help towards designing better models for logical reasoning over natural language.
    Boosting Tail Neural Network for Realtime Custom Keyword Spotting. (arXiv:2205.12933v1 [eess.AS])
    In this paper, we propose a Boosting Tail Neural Network (BTNN) for improving the performance of Realtime Custom Keyword Spotting (RCKS) that is still an industrial challenge for demanding powerful classification ability with limited computation resources. Inspired by Brain Science that a brain is only partly activated for a nerve simulation and numerous machine learning algorithms are developed to use a batch of weak classifiers to resolve arduous problems, which are often proved to be effective. We show that this method is helpful to the RCKS problem. The proposed approach achieve better performances in terms of wakeup rate and false alarm. In our experiments compared with those traditional algorithms that use only one strong classifier, it gets 18\% relative improvement. We also point out that this approach may be promising in future ASR exploration.
    Learning Mean Field Games: A Survey. (arXiv:2205.12944v1 [cs.LG])
    Non-cooperative and cooperative games with a very large number of players have many applications but remain generally intractable when the number of players increases. Introduced by Lasry and Lions, and Huang, Caines and Malham\'e, Mean Field Games (MFGs) rely on a mean-field approximation to allow the number of players to grow to infinity. Traditional methods for solving these games generally rely on solving partial or stochastic differential equations with a full knowledge of the model. Recently, Reinforcement Learning (RL) has appeared promising to solve complex problems. By combining MFGs and RL, we hope to solve games at a very large scale both in terms of population size and environment complexity. In this survey, we review the quickly growing recent literature on RL methods to learn Nash equilibria in MFGs. We first identify the most common settings (static, stationary, and evolutive). We then present a general framework for classical iterative methods (based on best-response computation or policy evaluation) to solve MFGs in an exact way. Building on these algorithms and the connection with Markov Decision Processes, we explain how RL can be used to learn MFG solutions in a model-free way. Last, we present numerical illustrations on a benchmark problem, and conclude with some perspectives.
    Optimizing Warfarin Dosing using Deep Reinforcement Learning. (arXiv:2202.03486v2 [cs.LG] UPDATED)
    Warfarin is a widely used anticoagulant, and has a narrow therapeutic range. Dosing of warfarin should be individualized, since slight overdosing or underdosing can have catastrophic or even fatal consequences. Despite much research on warfarin dosing, current dosing protocols do not live up to expectations, especially for patients sensitive to warfarin. We propose a deep reinforcement learning-based dosing model for warfarin. To overcome the issue of relatively small sample sizes in dosing trials, we use a Pharmacokinetic/ Pharmacodynamic (PK/PD) model of warfarin to simulate dose-responses of virtual patients. Applying the proposed algorithm on virtual test patients shows that this model outperforms a set of clinically accepted dosing protocols by a wide margin. We tested the robustness of our dosing protocol on a second PK/PD model and showed that its performance is comparable to the set of baseline protocols.
    Understanding Programmatic Weak Supervision via Source-aware Influence Function. (arXiv:2205.12879v1 [cs.LG])
    Programmatic Weak Supervision (PWS) aggregates the source votes of multiple weak supervision sources into probabilistic training labels, which are in turn used to train an end model. With its increasing popularity, it is critical to have some tool for users to understand the influence of each component (e.g., the source vote or training data) in the pipeline and interpret the end model behavior. To achieve this, we build on Influence Function (IF) and propose source-aware IF, which leverages the generation process of the probabilistic labels to decompose the end model's training objective and then calculate the influence associated with each (data, source, class) tuple. These primitive influence score can then be used to estimate the influence of individual component of PWS, such as source vote, supervision source, and training data. On datasets of diverse domains, we demonstrate multiple use cases: (1) interpreting incorrect predictions from multiple angles that reveals insights for debugging the PWS pipeline, (2) identifying mislabeling of sources with a gain of 9%-37% over baselines, and (3) improving the end model's generalization performance by removing harmful components in the training objective (13%-24% better than ordinary IF).
    Learning Distributions by Generative Adversarial Networks: Approximation and Generalization. (arXiv:2205.12601v1 [cs.LG])
    We study how well generative adversarial networks (GAN) learn probability distributions from finite samples by analyzing the convergence rates of these models. Our analysis is based on a new oracle inequality that decomposes the estimation error of GAN into the discriminator and generator approximation errors, generalization error and optimization error. To estimate the discriminator approximation error, we establish error bounds on approximating H\"older functions by ReLU neural networks, with explicit upper bounds on the Lipschitz constant of the network or norm constraint on the weights. For generator approximation error, we show that neural network can approximately transform a low-dimensional source distribution to a high-dimensional target distribution and bound such approximation error by the width and depth of neural network. Combining the approximation results with generalization bounds of neural networks from statistical learning theory, we establish the convergence rates of GANs in various settings, when the error is measured by a collection of integral probability metrics defined through H\"older classes, including the Wasserstein distance as a special case. In particular, for distributions concentrated around a low-dimensional set, we show that the convergence rates of GANs do not depend on the high ambient dimension, but on the lower intrinsic dimension.
    Non-Parametric Unsupervised Domain Adaptation for Neural Machine Translation. (arXiv:2109.06604v2 [cs.CL] UPDATED)
    Recently, $k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor ($k$NN) retrieval to achieve domain adaptation without retraining. Despite being conceptually attractive, it heavily relies on high-quality in-domain parallel corpora, limiting its capability on unsupervised domain adaptation, where in-domain parallel corpora are scarce or nonexistent. In this paper, we propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval. To this end, we first introduce an autoencoder task based on the target language, and then insert lightweight adapters into the original NMT model to map the token-level representation of this task to the ideal representation of translation task. Experiments on multi-domain datasets demonstrate that our proposed approach significantly improves the translation accuracy with target-side monolingual data, while achieving comparable performance with back-translation.
    From Noisy Prediction to True Label: Noisy Prediction Calibration via Generative Model. (arXiv:2205.00690v2 [cs.LG] UPDATED)
    Noisy labels are inevitable yet problematic in machine learning society. It ruins the generalization power of a classifier by making the classifier be trained to be overfitted to wrong labels. Existing methods on noisy label have focused on modifying classifier training procedure. It results in two possible problems. First, these methods are not applicable to a pre-trained classifier without further access into training. Second, it is not easy to train a classifier and remove all of negative effects from noisy labels simultaneously. From these problems, we suggests a new branch of approach, Noisy Prediction Calibration (NPC) in learning with noisy labels. Through the introduction and estimation of a new type of transition matrix via generative model, NPC corrects the noisy prediction from the pre-trained classifier to the true label as a post-processing scheme. We prove that NPC theoretically aligns with the transition matrix based methods. Yet, NPC provides more accurate pathway to estimate true label, even without involvement in classifier learning. Also, NPC is applicable to any classifier trained with noisy label methods, if training instances and its predictions are available. Our method, NPC, boosts the classification performances of all baseline models on both synthetic and real-world datasets.
    Certified Robustness Against Natural Language Attacks by Causal Intervention. (arXiv:2205.12331v1 [cs.LG])
    Deep learning models have achieved great success in many fields, yet they are vulnerable to adversarial examples. This paper follows a causal perspective to look into the adversarial vulnerability and proposes Causal Intervention by Semantic Smoothing (CISS), a novel framework towards robustness against natural language attacks. Instead of merely fitting observational data, CISS learns causal effects p(y|do(x)) by smoothing in the latent semantic space to make robust predictions, which scales to deep architectures and avoids tedious construction of noise customized for specific attacks. CISS is provably robust against word substitution attacks, as well as empirically robust even when perturbations are strengthened by unknown attack algorithms. For example, on YELP, CISS surpasses the runner-up by 6.7% in terms of certified robustness against word substitutions, and achieves 79.4% empirical robustness when syntactic attacks are integrated.
    Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently. (arXiv:2205.12808v1 [cs.LG])
    Driven by the empirical success and wide use of deep neural networks, understanding the generalization performance of overparameterized models has become an increasingly popular question. To this end, there has been substantial effort to characterize the implicit bias of the optimization algorithms used, such as gradient descent (GD), and the structural properties of their preferred solutions. This paper answers an open question in this literature: For the classification setting, what solution does mirror descent (MD) converge to? Specifically, motivated by its efficient implementation, we consider the family of mirror descent algorithms with potential function chosen as the $p$-th power of the $\ell_p$-norm, which is an important generalization of GD. We call this algorithm $p$-$\textsf{GD}$. For this family, we characterize the solutions it obtains and show that it converges in direction to a generalized maximum-margin solution with respect to the $\ell_p$-norm for linearly separable classification. While the MD update rule is in general expensive to compute and perhaps not suitable for deep learning, $p$-$\textsf{GD}$ is fully parallelizable in the same manner as SGD and can be used to train deep neural networks with virtually no additional computational overhead. Using comprehensive experiments with both linear and deep neural network models, we demonstrate that $p$-$\textsf{GD}$ can noticeably affect the structure and the generalization performance of the learned models.
    Graph Neural Networks Designed for Different Graph Types: A Survey. (arXiv:2204.03080v2 [cs.LG] UPDATED)
    Graphs are ubiquitous in nature and can therefore serve as models for many practical but also theoretical problems. Based on this, the young research field of Graph Neural Networks (GNNs) has emerged. Despite the youth of the field and the speed in which new models are developed, many good surveys have been published in the last years. Nevertheless, an overview on which graph types can be modeled by GNNs is missing. In this survey, we give a detailed overview of already existing GNNs and, unlike previous surveys, categorize them according to their ability to handle different graph types and properties. We consider GNNs operating on static as well as on dynamic graphs of different structural constitutions, with or without node or edge attributes. Moreover in the dynamic case, we separate the models in discrete-time and continuous-time dynamic graphs based on their architecture. While ordering the existing GNN models, we find, that there are still graph types, that are not or only rarely covered by existing GNN models. We point out where models are missing and give potential reasons for their absence.
    RADNet: Ensemble Model for Robust Glaucoma Classification in Color Fundus Images. (arXiv:2205.12902v1 [eess.IV])
    Glaucoma is one of the most severe eye diseases, characterized by rapid progression and leading to irreversible blindness. It is often the case that pathology diagnostics is carried out when the one's sight has already significantly degraded due to the lack of noticeable symptoms at early stage of the disease. Regular glaucoma screenings of the population shall improve early-stage detection, however the desirable frequency of etymological checkups is often not feasible due to excessive load imposed by manual diagnostics on limited number of specialists. Considering the basic methodology to detect glaucoma is to analyze fundus images for the \textit{optic-disc-to-optic-cup ratio}, Machine Learning domain can offer sophisticated tooling for image processing and classification. In our work, we propose an advanced image pre-processing technique combined with an ensemble of deep classification networks. Our \textit{Retinal Auto Detection (RADNet)} model has been successfully tested on Rotterdam EyePACS AIROGS train dataset with AUC of 0.92, and then additionally finetuned and tested on a fraction of RIM-ONE DL dataset with AUC of 0.91.
    Service Discovery in Social Internet of Things using Graph Neural Networks. (arXiv:2205.12711v1 [cs.LG])
    Internet-of-Things (IoT) networks intelligently connect thousands of physical entities to provide various services for the community. It is witnessing an exponential expansion, which is complicating the process of discovering IoT devices existing in the network and requesting corresponding services from them. As the highly dynamic nature of the IoT environment hinders the use of traditional solutions of service discovery, we aim, in this paper, to address this issue by proposing a scalable resource allocation neural model adequate for heterogeneous large-scale IoT networks. We devise a Graph Neural Network (GNN) approach that utilizes the social relationships formed between the devices in the IoT network to reduce the search space of any entity lookup and acquire a service from another device in the network. This proposed resource allocation approach surpasses standardization issues and embeds the structure and characteristics of the social IoT graph, by the means of GNNs, for eventual clustering analysis process. Simulation results applied on a real-world dataset illustrate the performance of this solution and its significant efficiency to operate on large-scale IoT networks.
    Core Challenges in Embodied Vision-Language Planning. (arXiv:2106.13948v4 [cs.LG] UPDATED)
    Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.
    Non-stationary Bandits with Knapsacks. (arXiv:2205.12427v1 [cs.LG])
    In this paper, we study the problem of bandits with knapsacks (BwK) in a non-stationary environment. The BwK problem generalizes the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm. At each time, the decision maker/player chooses to play an arm, and s/he will receive a reward and consume certain amount of resource from each of the multiple resource types. The objective is to maximize the cumulative reward over a finite horizon subject to some knapsack constraints on the resources. Existing works study the BwK problem under either a stochastic or adversarial environment. Our paper considers a non-stationary environment which continuously interpolates between these two extremes. We first show that the traditional notion of variation budget is insufficient to characterize the non-stationarity of the BwK problem for a sublinear regret due to the presence of the constraints, and then we propose a new notion of global non-stationarity measure. We employ both non-stationarity measures to derive upper and lower bounds for the problem. Our results are based on a primal-dual analysis of the underlying linear programs and highlight the interplay between the constraints and the non-stationarity. Finally, we also extend the non-stationarity measure to the problem of online convex optimization with constraints and obtain new regret bounds accordingly.
    Conditional Gradients for the Approximately Vanishing Ideal. (arXiv:2202.03349v8 [cs.LG] UPDATED)
    The vanishing ideal of a set of points $X\subseteq \mathbb{R}^n$ is the set of polynomials that evaluate to $0$ over all points $\mathbf{x} \in X$ and admits an efficient representation by a finite set of polynomials called generators. To accommodate the noise in the data set, we introduce the Conditional Gradients Approximately Vanishing Ideal algorithm (CGAVI) for the construction of the set of generators of the approximately vanishing ideal. The constructed set of generators captures polynomial structures in data and gives rise to a feature map that can, for example, be used in combination with a linear classifier for supervised learning. In CGAVI, we construct the set of generators by solving specific instances of (constrained) convex optimization problems with the Pairwise Frank-Wolfe algorithm (PFW). Among other things, the constructed generators inherit the LASSO generalization bound and not only vanish on the training but also on out-sample data. Moreover, CGAVI admits a compact representation of the approximately vanishing ideal by constructing few generators with sparse coefficient vectors.
    Women, artificial intelligence, and key positions in collaboration networks: Towards a more equal scientific ecosystem. (arXiv:2205.12339v1 [cs.SI])
    Scientific collaboration in almost every discipline is mainly driven by the need of sharing knowledge, expertise, and pooled resources. Science is becoming more complex which has encouraged scientists to involve more in collaborative research projects in order to better address the challenges. As a highly interdisciplinary field with a rapidly evolving scientific landscape, artificial intelligence calls for researchers with special profiles covering a diverse set of skills and expertise. Understanding gender aspects of scientific collaboration is of paramount importance, especially in a field such as artificial intelligence that has been attracting large investments. Using social network analysis, natural language processing, and machine learning and focusing on artificial intelligence publications for the period from 2000 to 2019, in this work, we comprehensively investigated the effects of several driving factors on acquiring key positions in scientific collaboration networks through a gender lens. It was found that, regardless of gender, scientific performance in terms of quantity and impact plays a crucial in possessing the "social researcher" in the network. However, subtle differences were observed between female and male researchers in acquiring the "local influencer" role.
    A Deeper Understanding of State-Based Critics in Multi-Agent Reinforcement Learning. (arXiv:2201.01221v2 [cs.LG] UPDATED)
    Centralized Training for Decentralized Execution, where training is done in a centralized offline fashion, has become a popular solution paradigm in Multi-Agent Reinforcement Learning. Many such methods take the form of actor-critic with state-based critics, since centralized training allows access to the true system state, which can be useful during training despite not being available at execution time. State-based critics have become a common empirical choice, albeit one which has had limited theoretical justification or analysis. In this paper, we show that state-based critics can introduce bias in the policy gradient estimates, potentially undermining the asymptotic guarantees of the algorithm. We also show that, even if the state-based critics do not introduce any bias, they can still result in a larger gradient variance, contrary to the common intuition. Finally, we show the effects of the theories in practice by comparing different forms of centralized critics on a wide range of common benchmarks, and detail how various environmental properties are related to the effectiveness of different types of critics.
    Fast Stochastic Composite Minimization and an Accelerated Frank-Wolfe Algorithm under Parallelization. (arXiv:2205.12751v1 [math.OC])
    We consider the problem of minimizing the sum of two convex functions. One of those functions has Lipschitz-continuous gradients, and can be accessed via stochastic oracles, whereas the other is "simple". We provide a Bregman-type algorithm with accelerated convergence in function values to a ball containing the minimum. The radius of this ball depends on problem-dependent constants, including the variance of the stochastic oracle. We further show that this algorithmic setup naturally leads to a variant of Frank-Wolfe achieving acceleration under parallelization. More precisely, when minimizing a smooth convex function on a bounded domain, we show that one can achieve an $\epsilon$ primal-dual gap (in expectation) in $\tilde{O}(1/ \sqrt{\epsilon})$ iterations, by only accessing gradients of the original function and a linear maximization oracle with $O(1/\sqrt{\epsilon})$ computing units in parallel. We illustrate this fast convergence on synthetic numerical experiments.
    Is a Question Decomposition Unit All We Need?. (arXiv:2205.12538v1 [cs.CL])
    Large Language Models (LMs) have achieved state-of-the-art performance on many Natural Language Processing (NLP) benchmarks. With the growing number of new benchmarks, we build bigger and more complex LMs. However, building new LMs may not be an ideal option owing to the cost, time and environmental impact associated with it. We explore an alternative route: can we modify data by expressing it in terms of the model's strengths, so that a question becomes easier for models to answer? We investigate if humans can decompose a hard question into a set of simpler questions that are relatively easier for models to solve. We analyze a range of datasets involving various forms of reasoning and find that it is indeed possible to significantly improve model performance (24% for GPT3 and 29% for RoBERTa-SQuAD along with a symbolic calculator) via decomposition. Our approach provides a viable option to involve people in NLP research in a meaningful way. Our findings indicate that Human-in-the-loop Question Decomposition (HQD) can potentially provide an alternate path to building large LMs.
    Adaptively Exploiting d-Separators with Causal Bandits. (arXiv:2202.05100v2 [stat.ML] UPDATED)
    Multi-armed bandit problems provide a framework to identify the optimal intervention over a sequence of repeated experiments. Without additional assumptions, minimax optimal performance (measured by cumulative regret) is well-understood. With access to additional observed variables that d-separate the intervention from the outcome (i.e., they are a d-separator), recent "causal bandit" algorithms provably incur less regret. However, in practice it is desirable to be agnostic to whether observed variables are a d-separator. Ideally, an algorithm should be adaptive; that is, perform nearly as well as an algorithm with oracle knowledge of the presence or absence of a d-separator. In this work, we formalize and study this notion of adaptivity, and provide a novel algorithm that simultaneously achieves (a) optimal regret when a d-separator is observed, improving on classical minimax algorithms, and (b) significantly smaller regret than recent causal bandit algorithms when the observed variables are not a d-separator. Crucially, our algorithm does not require any oracle knowledge of whether a d-separator is observed. We also generalize this adaptivity to other conditions, such as the front-door criterion.
    Transportation-Inequalities, Lyapunov Stability and Sampling for Dynamical Systems on Continuous State Space. (arXiv:2205.12448v1 [stat.ML])
    We study the concentration phenomenon for discrete-time random dynamical systems with an unbounded state space. We develop a heuristic approach towards obtaining exponential concentration inequalities for dynamical systems using an entirely functional analytic framework. We also show that existence of exponential-type Lyapunov function, compared to the purely deterministic setting, not only implies stability but also exponential concentration inequalities for sampling from the stationary distribution, via \emph{transport-entropy inequality} (T-E). These results have significant impact in \emph{reinforcement learning} (RL) and \emph{controls}, leading to exponential concentration inequalities even for unbounded observables, while neither assuming reversibility nor exact knowledge of random dynamical system (assumptions at heart of concentration inequalities in statistical mechanics and Markov diffusion processes).
    FastAdaBelief: Improving Convergence Rate for Belief-based Adaptive Optimizers by Exploiting Strong Convexity. (arXiv:2104.13790v3 [cs.LG] UPDATED)
    AdaBelief, one of the current best optimizers, demonstrates superior generalization ability compared to the popular Adam algorithm by viewing the exponential moving average of observed gradients. AdaBelief is theoretically appealing in that it has a data-dependent $O(\sqrt{T})$ regret bound when objective functions are convex, where $T$ is a time horizon. It remains however an open problem whether the convergence rate can be further improved without sacrificing its generalization ability. %on how to exploit strong convexity to further improve the convergence rate of AdaBelief. To this end, we make a first attempt in this work and design a novel optimization algorithm called FastAdaBelief that aims to exploit its strong convexity in order to achieve an even faster convergence rate. In particular, by adjusting the step size that better considers strong convexity and prevents fluctuation, our proposed FastAdaBelief demonstrates excellent generalization ability as well as superior convergence. As an important theoretical contribution, we prove that FastAdaBelief attains a data-dependant $O(\log T)$ regret bound, which is substantially lower than AdaBelief. On the empirical side, we validate our theoretical analysis with extensive experiments in both scenarios of strong and non-strong convexity on three popular baseline models. Experimental results are very encouraging: FastAdaBelief converges the quickest in comparison to all mainstream algorithms while maintaining an excellent generalization ability, in cases of both strong or non-strong convexity. FastAdaBelief is thus posited as a new benchmark model for the research community.
    Detecting Multi-Sensor Fusion Errors in Advanced Driver-Assistance Systems. (arXiv:2109.06404v3 [cs.RO] UPDATED)
    Advanced Driver-Assistance Systems (ADAS) have been thriving and widely deployed in recent years. In general, these systems receive sensor data, compute driving decisions, and output control signals to the vehicles. To smooth out the uncertainties brought by sensor outputs, they usually leverage multi-sensor fusion (MSF) to fuse the sensor outputs and produce a more reliable understanding of the surroundings. However, MSF cannot completely eliminate the uncertainties since it lacks the knowledge about which sensor provides the most accurate data and how to optimally integrate the data provided by the sensors. As a result, critical consequences might happen unexpectedly. In this work, we observed that the popular MSF methods in an industry-grade ADAS can mislead the car control and result in serious safety hazards. We define the failures (e.g., car crashes) caused by the faulty MSF as fusion errors and develop a novel evolutionary-based domain-specific search framework, FusED, for the efficient detection of fusion errors. We further apply causality analysis to show that the found fusion errors are indeed caused by the MSF method. We evaluate our framework on two widely used MSF methods in two driving environments. Experimental results show that FusED identifies more than 150 fusion errors. Finally, we provide several suggestions to improve the MSF methods we study.
    Label Leakage and Protection from Forward Embedding in Vertical Federated Learning. (arXiv:2203.01451v3 [cs.LG] UPDATED)
    Vertical federated learning (vFL) has gained much attention and been deployed to solve machine learning problems with data privacy concerns in recent years. However, some recent work demonstrated that vFL is vulnerable to privacy leakage even though only the forward intermediate embedding (rather than raw features) and backpropagated gradients (rather than raw labels) are communicated between the involved participants. As the raw labels often contain highly sensitive information, some recent work has been proposed to prevent the label leakage from the backpropagated gradients effectively in vFL. However, these work only identified and defended the threat of label leakage from the backpropagated gradients. None of these work has paid attention to the problem of label leakage from the intermediate embedding. In this paper, we propose a practical label inference method which can steal private labels effectively from the shared intermediate embedding even though some existing protection methods such as label differential privacy and gradients perturbation are applied. The effectiveness of the label attack is inseparable from the correlation between the intermediate embedding and corresponding private labels. To mitigate the issue of label leakage from the forward embedding, we add an additional optimization goal at the label party to limit the label stealing ability of the adversary by minimizing the distance correlation between the intermediate embedding and corresponding private labels. We conducted massive experiments to demonstrate the effectiveness of our proposed protection methods.
    Interpretable Feature Engineering for Time Series Predictors using Attention Networks. (arXiv:2205.12723v1 [cs.LG])
    Regression problems with time-series predictors are common in banking and many other areas of application. In this paper, we use multi-head attention networks to develop interpretable features and use them to achieve good predictive performance. The customized attention layer explicitly uses multiplicative interactions and builds feature-engineering heads that capture temporal dynamics in a parsimonious manner. Convolutional layers are used to combine multivariate time series. We also discuss methods for handling static covariates in the modeling process. Visualization and explanation tools are used to interpret the results and explain the relationship between the inputs and the extracted features. Both simulation and real dataset are used to illustrate the usefulness of the methodology. Keyword: Attention heads, Deep neural networks, Interpretable feature engineering
    Learning Mixtures of Linear Dynamical Systems. (arXiv:2201.11211v2 [stat.ML] UPDATED)
    We study the problem of learning a mixture of multiple linear dynamical systems (LDSs) from unlabeled short sample trajectories, each generated by one of the LDS models. Despite the wide applicability of mixture models for time-series data, learning algorithms that come with end-to-end performance guarantees are largely absent from existing literature. There are multiple sources of technical challenges, including but not limited to (1) the presence of latent variables (i.e. the unknown labels of trajectories); (2) the possibility that the sample trajectories might have lengths much smaller than the dimension $d$ of the LDS models; and (3) the complicated temporal dependence inherent to time-series data. To tackle these challenges, we develop a two-stage meta-algorithm, which is guaranteed to efficiently recover each ground-truth LDS model up to error $\tilde{O}(\sqrt{d/T})$, where $T$ is the total sample size. We validate our theoretical studies with numerical experiments, confirming the efficacy of the proposed algorithm.
    Structured Uncertainty in the Observation Space of Variational Autoencoders. (arXiv:2205.12533v1 [cs.LG])
    Variational autoencoders (VAEs) are a popular class of deep generative models with many variants and a wide range of applications. Improvements upon the standard VAE mostly focus on the modelling of the posterior distribution over the latent space and the properties of the neural network decoder. In contrast, improving the model for the observational distribution is rarely considered and typically defaults to a pixel-wise independent categorical or normal distribution. In image synthesis, sampling from such distributions produces spatially-incoherent results with uncorrelated pixel noise, resulting in only the sample mean being somewhat useful as an output prediction. In this paper, we aim to stay true to VAE theory by improving the samples from the observational distribution. We propose an alternative model for the observation space, encoding spatial dependencies via a low-rank parameterisation. We demonstrate that this new observational distribution has the ability to capture relevant covariance between pixels, resulting in spatially-coherent samples. In contrast to pixel-wise independent distributions, our samples seem to contain semantically meaningful variations from the mean allowing the prediction of multiple plausible outputs with a single forward pass.
    Global geomagnetic perturbation forecasting using Deep Learning. (arXiv:2205.12734v1 [physics.space-ph])
    Geomagnetically Induced Currents (GICs) arise from spatio-temporal changes to Earth's magnetic field which arise from the interaction of the solar wind with Earth's magnetosphere, and drive catastrophic destruction to our technologically dependent society. Hence, computational models to forecast GICs globally with large forecast horizon, high spatial resolution and temporal cadence are of increasing importance to perform prompt necessary mitigation. Since GIC data is proprietary, the time variability of horizontal component of the magnetic field perturbation (dB/dt) is used as a proxy for GICs. In this work, we develop a fast, global dB/dt forecasting model, which forecasts 30 minutes into the future using only solar wind measurements as input. The model summarizes 2 hours of solar wind measurement using a Gated Recurrent Unit, and generates forecasts of coefficients which are folded with a spherical harmonic basis to enable global forecasts. When deployed, our model produces results in under a second, and generates global forecasts for horizontal magnetic perturbation components at 1-minute cadence. We evaluate our model across models in literature for two specific storms of 5 August 2011 and 17 March 2015, while having a self-consistent benchmark model set. Our model outperforms, or has consistent performance with state-of-the-practice high time cadence local and low time cadence global models, while also outperforming/having comparable performance with the benchmark models. Such quick inferences at high temporal cadence and arbitrary spatial resolutions may ultimately enable accurate forewarning of dB/dt for any place on Earth, resulting in precautionary measures to be taken in an informed manner.
    Label Leakage and Protection in Two-party Split Learning. (arXiv:2102.08504v3 [cs.LG] UPDATED)
    Two-party split learning is a popular technique for learning a model across feature-partitioned data. In this work, we explore whether it is possible for one party to steal the private label information from the other party during split training, and whether there are methods that can protect against such attacks. Specifically, we first formulate a realistic threat model and propose a privacy loss metric to quantify label leakage in split learning. We then show that there exist two simple yet effective methods within the threat model that can allow one party to accurately recover private ground-truth labels owned by the other party. To combat these attacks, we propose several random perturbation techniques, including $\texttt{Marvell}$, an approach that strategically finds the structure of the noise perturbation by minimizing the amount of label leakage (measured through our quantification metric) of a worst-case adversary. We empirically demonstrate the effectiveness of our protection techniques against the identified attacks, and show that $\texttt{Marvell}$ in particular has improved privacy-utility tradeoffs relative to baseline approaches.
    Domain Adaptation via Maximizing Surrogate Mutual Information. (arXiv:2110.12184v2 [cs.LG] UPDATED)
    Unsupervised domain adaptation (UDA) aims to predict unlabeled data from target domain with access to labeled data from the source domain. In this work, we propose a novel framework called SIDA (Surrogate Mutual Information Maximization Domain Adaptation) with strong theoretical guarantees. To be specific, SIDA implements adaptation by maximizing mutual information (MI) between features. In the framework, a surrogate joint distribution models the underlying joint distribution of the unlabeled target domain. Our theoretical analysis validates SIDA by bounding the expected risk on target domain with MI and surrogate distribution bias. Experiments show that our approach is comparable with state-of-the-art unsupervised adaptation methods on standard UDA tasks.
    Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty. (arXiv:2110.08607v3 [cs.LG] UPDATED)
    In this paper, we propose a probabilistic physics-guided framework, termed Physics-guided Deep Markov Model (PgDMM). The framework targets the inference of the characteristics and latent structure of nonlinear dynamical systems from measurement data, where exact inference of latent variables is typically intractable. A recently surfaced option pertains to leveraging variational inference to perform approximate inference. In such a scheme, transition and emission functions of the system are parameterized via feed-forward neural networks (deep generative models). However, due to the generalized and highly versatile formulation of neural network functions, the learned latent space often lacks physical interpretation and structured representation. To address this, we bridge physics-based state space models with Deep Markov Models, thus delivering a hybrid modeling framework for unsupervised learning and identification of nonlinear dynamical systems. The proposed framework takes advantage of the expressive power of deep learning, while retaining the driving physics of the dynamical system by imposing physics-driven restrictions on the side of the latent space. We demonstrate the benefits of such a fusion in terms of achieving improved performance on illustrative simulation examples and experimental case studies of nonlinear systems. Our results indicate that the physics-based models involved in the employed transition and emission functions essentially enforce a more structured and physically interpretable latent space, which is essential for enhancing and generalizing the predictive capabilities of deep learning-based models.
    Reward Uncertainty for Exploration in Preference-based Reinforcement Learning. (arXiv:2205.12401v1 [cs.LG])
    Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL methods are able to learn a more flexible reward model based on human preferences by actively incorporating human feedback, i.e. teacher's preferences between two clips of behaviors. However, poor feedback-efficiency still remains a problem in current preference-based RL algorithms, as tailored human feedback is very expensive. To handle this issue, previous methods have mainly focused on improving query selection and policy initialization. At the same time, recent exploration methods have proven to be a recipe for improving sample-efficiency in RL. We present an exploration method specifically for preference-based RL algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Specifically, we utilize disagreement across ensemble of learned reward models. Our intuition is that disagreement in learned reward model reflects uncertainty in tailored human feedback and could be useful for exploration. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms on complex robot manipulation tasks from MetaWorld benchmarks, compared with other existing exploration methods that measure the novelty of state visitation.
    Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT. (arXiv:2205.12399v1 [cs.LG])
    We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of linear, mixing transformations to design the Sparse Mixer encoder model. The Sparse Mixer slightly outperforms (<1%) BERT on GLUE and SuperGLUE, but more importantly trains 65% faster and runs inference 61% faster. We also present a faster variant, prosaically named Fast Sparse Mixer, that marginally underperforms (<0.2%) BERT on SuperGLUE, but trains and runs nearly twice as fast: 89% faster training and 98% faster inference. We justify the design of these two models by carefully ablating through various mixing mechanisms, MoE configurations and model hyperparameters. The Sparse Mixer overcomes many of the latency and stability concerns of MoE models and offers the prospect of serving sparse student models, without resorting to distilling them to dense variants.
    Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret. (arXiv:2205.12418v1 [cs.LG])
    We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.
    GuardNN: Secure Accelerator Architecture for Privacy-Preserving Deep Learning. (arXiv:2008.11632v2 [cs.CR] UPDATED)
    This paper proposes GuardNN, a secure DNN accelerator that provides hardware-based protection for user data and model parameters even in an untrusted environment. GuardNN shows that the architecture and protection can be customized for a specific application to provide strong confidentiality and integrity guarantees with negligible overhead. The design of the GuardNN instruction set reduces the TCB to just the accelerator and allows confidentiality protection even when the instructions from a host cannot be trusted. GuardNN minimizes the overhead of memory encryption and integrity verification by customizing the off-chip memory protection for the known memory access patterns of a DNN accelerator. GuardNN is prototyped on an FPGA, demonstrating effective confidentiality protection with ~3% performance overhead for inference.
    On the Interpretability of Regularisation for Neural Networks Through Model Gradient Similarity. (arXiv:2205.12642v1 [stat.ML])
    Most complex machine learning and modelling techniques are prone to over-fitting and may subsequently generalise poorly to future data. Artificial neural networks are no different in this regard and, despite having a level of implicit regularisation when trained with gradient descent, often require the aid of explicit regularisers. We introduce a new framework, Model Gradient Similarity (MGS), that (1) serves as a metric of regularisation, which can be used to monitor neural network training, (2) adds insight into how explicit regularisers, while derived from widely different principles, operate via the same mechanism underneath by increasing MGS, and (3) provides the basis for a new regularisation scheme which exhibits excellent performance, especially in challenging settings such as high levels of label noise or limited sample sizes.
    Convolutional Neural Processes for Inpainting Satellite Images. (arXiv:2205.12407v1 [cs.CV])
    The widespread availability of satellite images has allowed researchers to model complex systems such as disease dynamics. However, many satellite images have missing values due to measurement defects, which render them unusable without data imputation. For example, the scanline corrector for the LANDSAT 7 satellite broke down in 2003, resulting in a loss of around 20\% of its data. Inpainting involves predicting what is missing based on the known pixels and is an old problem in image processing, classically based on PDEs or interpolation methods, but recent deep learning approaches have shown promise. However, many of these methods do not explicitly take into account the inherent spatiotemporal structure of satellite images. In this work, we cast satellite image inpainting as a natural meta-learning problem, and propose using convolutional neural processes (ConvNPs) where we frame each satellite image as its own task or 2D regression problem. We show ConvNPs can outperform classical methods and state-of-the-art deep learning inpainting models on a scanline inpainting problem for LANDSAT 7 satellite images, assessed on a variety of in and out-of-distribution images.
    Learning from time-dependent streaming data with online stochastic algorithms. (arXiv:2205.12549v1 [cs.LG])
    We study stochastic algorithms in a streaming framework, trained on samples coming from a dependent data source. In this streaming framework, we analyze the convergence of Stochastic Gradient (SG) methods in a non-asymptotic manner; this includes various SG methods such as the well-known stochastic gradient descent (i.e., Robbins-Monro algorithm), mini-batch SG methods, together with their averaged estimates (i.e., Polyak-Ruppert averaged). Our results form a heuristic by linking the level of dependency and convexity to the rest of the model parameters. This heuristic provides new insights into choosing the optimal learning rate, which can help increase the stability of SGbased methods; these investigations suggest large streaming batches with slow decaying learning rates for highly dependent data sources.
    Stochastic Second-Order Methods Provably Beat SGD For Gradient-Dominated Functions. (arXiv:2205.12856v1 [cs.LG])
    We study the performance of Stochastic Cubic Regularized Newton (SCRN) on a class of functions satisfying gradient dominance property which holds in a wide range of applications in machine learning and signal processing. This condition ensures that any first-order stationary point is a global optimum. We prove that SCRN improves the best-known sample complexity of stochastic gradient descent in achieving $\epsilon$-global optimum by a factor of $\mathcal{O}(\epsilon^{-1/2})$. Even under a weak version of gradient dominance property, which is applicable to policy-based reinforcement learning (RL), SCRN achieves the same improvement over stochastic policy gradient methods. Additionally, we show that the sample complexity of SCRN can be improved by a factor of ${\mathcal{O}}(\epsilon^{-1/2})$ using a variance reduction method with time-varying batch sizes. Experimental results in various RL settings showcase the remarkable performance of SCRN compared to first-order methods.
    Learning the Travelling Salesperson Problem Requires Rethinking Generalization. (arXiv:2006.07054v6 [cs.LG] UPDATED)
    End-to-end training of neural network solvers for graph combinatorial optimization problems such as the Travelling Salesperson Problem (TSP) have seen a surge of interest recently, but remain intractable and inefficient beyond graphs with few hundreds of nodes. While state-of-the-art learning-driven approaches for TSP perform closely to classical solvers when trained on trivially small sizes, they are unable to generalize the learnt policy to larger instances at practical scales. This work presents an end-to-end neural combinatorial optimization pipeline that unifies several recent papers in order to identify the inductive biases, model architectures and learning algorithms that promote generalization to instances larger than those seen in training. Our controlled experiments provide the first principled investigation into such zero-shot generalization, revealing that extrapolating beyond training data requires rethinking the neural combinatorial optimization pipeline, from network layers and learning paradigms to evaluation protocols. Additionally, we analyze recent advances in deep learning for routing problems through the lens of our pipeline and provide new directions to stimulate future research.
    Hardness of Maximum Likelihood Learning of DPPs. (arXiv:2205.12377v1 [cs.CC])
    Determinantal Point Processes (DPPs) are a widely used probabilistic model for negatively correlated sets. DPPs have been successfully employed in Machine Learning applications to select a diverse, yet representative subset of data. In seminal work on DPPs in Machine Learning, Kulesza conjectured in his PhD Thesis (2011) that the problem is NP-complete. The lack of a formal proof prompted Brunel, Moitra, Rigollet and Urschel (COLT 2017) to conjecture that, in opposition to Kulesza's conjecture, there exists a polynomial-time algorithm for computing a maximum-likelihood DPP. They also presented some preliminary evidence supporting their conjecture. In this work we prove Kulesza's conjecture. In fact, we prove the following stronger hardness of approximation result: even computing a $\left(1-O(\frac{1}{\log^9{N}})\right)$-approximation to the maximum log-likelihood of a DPP on a ground set of $N$ elements is NP-complete. At the same time, we also obtain the first polynomial-time algorithm that achieves a nontrivial worst-case approximation to the optimal log-likelihood: the approximation factor is $\frac{1}{(1+o(1))\log{m}}$ unconditionally (for data sets that consist of $m$ subsets), and can be improved to $1-\frac{1+o(1)}{\log N}$ if all $N$ elements appear in a $O(1/N)$-fraction of the subsets. In terms of techniques, we reduce approximating the maximum log-likelihood of DPPs on a data set to solving a gap instance of a "vector coloring" problem on a hypergraph. Such a hypergraph is built on a bounded-degree graph construction of Bogdanov, Obata and Trevisan (FOCS 2002), and is further enhanced by the strong expanders of Alon and Capalbo (FOCS 2007) to serve our purposes.
    Action Recognition for American Sign Language. (arXiv:2205.12261v1 [cs.CV])
    In this research, we present our findings to recognize American Sign Language from series of hand gestures. While most researches in literature focus only on static handshapes, our work target dynamic hand gestures. Since dynamic signs dataset are very few, we collect an initial dataset of 150 videos for 10 signs and an extension of 225 videos for 15 signs. We apply transfer learning models in combination with deep neural networks and background subtraction for videos in different temporal settings. Our primarily results show that we can get an accuracy of $0.86$ and $0.71$ using DenseNet201, LSTM with video sequence of 12 frames accordingly.
    Learning JPEG Compression Artifacts for Image Manipulation Detection and Localization. (arXiv:2108.12947v2 [eess.IV] UPDATED)
    Detecting and localizing image manipulation are necessary to counter malicious use of image editing techniques. Accordingly, it is essential to distinguish between authentic and tampered regions by analyzing intrinsic statistics in an image. We focus on JPEG compression artifacts left during image acquisition and editing. We propose a convolutional neural network (CNN) that uses discrete cosine transform (DCT) coefficients, where compression artifacts remain, to localize image manipulation. Standard CNNs cannot learn the distribution of DCT coefficients because the convolution throws away the spatial coordinates, which are essential for DCT coefficients. We illustrate how to design and train a neural network that can learn the distribution of DCT coefficients. Furthermore, we introduce Compression Artifact Tracing Network (CAT-Net) that jointly uses image acquisition artifacts and compression artifacts. It significantly outperforms traditional and deep neural network-based methods in detecting and localizing tampered regions.
    Learning to Model Editing Processes. (arXiv:2205.12374v1 [cs.CL])
    Most existing sequence generation models produce outputs in one pass, usually left-to-right. However, this is in contrast with a more natural approach that humans use in generating content; iterative refinement and editing. Recent work has introduced edit-based models for various tasks (such as neural machine translation and text style transfer), but these generally model a single edit step. In this work, we propose modeling editing processes, modeling the whole process of iteratively generating sequences. We form a conceptual framework to describe the likelihood of multi-step edits, and describe neural models that can learn a generative model of sequences based on these multistep edits. We introduce baseline results and metrics on this task, finding that modeling editing processes improves performance on a variety of axes on both our proposed task and related downstream tasks compared to previous single-step models of edits.
    AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models. (arXiv:2205.12410v1 [cs.CL])
    Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters. This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation. Parameter-efficient techniques have been developed that tune small trainable components (e.g., adapters) injected in the large model while keeping most of the model weights frozen. The prevalent mechanism to increase adapter capacity is to increase the bottleneck dimension which increases the adapter parameters. In this work, we introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques. (i) We introduce multiple shared adapter components in each layer of the Transformer architecture. We leverage sparse learning via random routing to update the adapter parameters (encoder is kept frozen) resulting in the same amount of computational cost (FLOPs) as that of training a single adapter. (ii) We propose a simple merging mechanism to average the weights of multiple adapter components to collapse to a single adapter in each Transformer layer, thereby, keeping the overall parameters also the same but with significant performance improvement. We demonstrate these techniques to work well across multiple task settings including fully supervised and few-shot Natural Language Understanding tasks. By only tuning 0.23% of a pre-trained language model's parameters, our model outperforms the full model fine-tuning performance and several competing methods.
    Over-the-Air Design of GAN Training for mmWave MIMO Channel Estimation. (arXiv:2205.12445v1 [eess.SP])
    Future wireless systems are trending towards higher carrier frequencies that offer larger communication bandwidth but necessitate the use of large antenna arrays. Existing signal processing techniques for channel estimation do not scale well to this "high-dimensional" regime in terms of performance and pilot overhead. Meanwhile, training deep learning based approaches for channel estimation requires large labeled datasets mapping pilot measurements to clean channel realizations, which can only be generated offline using simulated channels. In this paper, we develop a novel unsupervised over-the-air (OTA) algorithm that utilizes noisy received pilot measurements to train a deep generative model to output beamspace MIMO channel realizations. Our approach leverages Generative Adversarial Networks (GAN), while using a conditional input to distinguish between Line-of-Sight (LOS) and Non-Line-of-Sight (NLOS) channel realizations. We also present a federated implementation of the OTA algorithm that distributes the GAN training over multiple users and greatly reduces the user side computation. We then formulate channel estimation from a limited number of pilot measurements as an inverse problem and reconstruct the channel by optimizing the input vector of the trained generative model. Our proposed approach significantly outperforms Orthogonal Matching Pursuit on both LOS and NLOS channel models, and EM-GM-AMP -- an Approximate Message Passing algorithm -- on LOS channel models, while achieving comparable performance on NLOS channel models in terms of the normalized channel reconstruction error. More importantly, our proposed framework has the potential to be trained online using real noisy pilot measurements, is not restricted to a specific channel model and can even be utilized for a federated OTA design of a dataset generator from noisy data.
    VeriFi: Towards Verifiable Federated Unlearning. (arXiv:2205.12709v1 [cs.CR])
    Federated learning (FL) is a collaborative learning paradigm where participants jointly train a powerful model without sharing their private data. One desirable property for FL is the implementation of the right to be forgotten (RTBF), i.e., a leaving participant has the right to request to delete its private data from the global model. However, unlearning itself may not be enough to implement RTBF unless the unlearning effect can be independently verified, an important aspect that has been overlooked in the current literature. In this paper, we prompt the concept of verifiable federated unlearning, and propose VeriFi, a unified framework integrating federated unlearning and verification that allows systematic analysis of the unlearning and quantification of its effect, with different combinations of multiple unlearning and verification methods. In VeriFi, the leaving participant is granted the right to verify (RTV), that is, the participant notifies the server before leaving, then actively verifies the unlearning effect in the next few communication rounds. The unlearning is done at the server side immediately after receiving the leaving notification, while the verification is done locally by the leaving participant via two steps: marking (injecting carefully-designed markers to fingerprint the leaver) and checking (examining the change of the global model's performance on the markers). Based on VeriFi, we conduct the first systematic and large-scale study for verifiable federated unlearning, considering 7 unlearning methods and 5 verification methods. Particularly, we propose a more efficient and FL-friendly unlearning method, and two more effective and robust non-invasive-verification methods. We extensively evaluate VeriFi on 7 datasets and 4 types of deep learning models. Our analysis establishes important empirical understandings for more trustworthy federated unlearning.
    A Kernel Stein Test for Comparing Latent Variable Models. (arXiv:1907.00586v4 [stat.ML] UPDATED)
    We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018) to the case of latent variable models, a much more general class than the fully observed models treated previously. The new test, with a properly calibrated threshold, has a well-controlled type-I error. In the case of certain models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative Maximum Mean Discrepancy test, which is based on samples from the models and does not exploit the latent structure.
    MAPLE: Microprocessor A Priori for Latency Estimation. (arXiv:2111.15106v2 [cs.LG] UPDATED)
    Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption. As such, neural architecture search (NAS) algorithms take these two constraints into account when generating a new architecture. However, efficiency metrics such as latency are typically hardware dependent requiring the NAS algorithm to either measure or predict the architecture latency. Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process. Here we propose Microprocessor A Priori for Latency Estimation MAPLE that does not rely on transfer learning or domain adaptation but instead generalizes to new hardware by incorporating a prior hardware characteristics during training. MAPLE takes advantage of a novel quantitative strategy to characterize the underlying microprocessor by measuring relevant hardware performance metrics, yielding a fine-grained and expressive hardware descriptor. Moreover, the proposed MAPLE benefits from the tightly coupled I/O between the CPU and GPU and their dependency to predict DNN latency on GPUs while measuring microprocessor performance hardware counters from the CPU feeding the GPU hardware. Through this quantitative strategy as the hardware descriptor, MAPLE can generalize to new hardware via a few shot adaptation strategy where with as few as 3 samples it exhibits a 6% improvement over state-of-the-art methods requiring as much as 10 samples. Experimental results showed that, increasing the few shot adaptation samples to 10 improves the accuracy significantly over the state-of-the-art methods by 12%. Furthermore, it was demonstrated that MAPLE exhibiting 8-10% better accuracy, on average, compared to relevant baselines at any number of adaptation samples.
    When Is Partially Observable Reinforcement Learning Not Scary?. (arXiv:2204.08967v2 [cs.LG] UPDATED)
    Applications of Reinforcement Learning (RL), in which agents learn to make a sequence of decisions despite lacking complete information about the latent states of the controlled system, that is, they act under partial observability of the states, are ubiquitous. Partially observable RL can be notoriously difficult -- well-known information-theoretic results show that learning partially observable Markov decision processes (POMDPs) requires an exponential number of samples in the worst case. Yet, this does not rule out the existence of large subclasses of POMDPs over which learning is tractable. In this paper we identify such a subclass, which we call weakly revealing POMDPs. This family rules out the pathological instances of POMDPs where observations are uninformative to a degree that makes learning hard. We prove that for weakly revealing POMDPs, a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to guarantee polynomial sample complexity. To the best of our knowledge, this is the first provably sample-efficient result for learning from interactions in overcomplete POMDPs, where the number of latent states can be larger than the number of observations.
    ColdGuess: A General and Effective Relational Graph Convolutional Network to Tackle Cold Start Cases. (arXiv:2205.12318v1 [cs.LG])
    Low-quality listings and bad actor behavior in online retail websites threatens e-commerce business as these result in sub-optimal buying experience and erode customer trust. When a new listing is created, how to tell it has good-quality? Is the method effective, fast, and scalable? Previous approaches often have three limitations/challenges: (1) unable to handle cold start problems where new sellers/listings lack sufficient selling histories. (2) inability of scoring hundreds of millions of listings at scale, or compromise performance for scalability. (3) has space challenges from large-scale graph with giant e-commerce business size. To overcome these limitations/challenges, we proposed ColdGuess, an inductive graph-based risk predictor built upon a heterogeneous seller product graph, which effectively identifies risky seller/product/listings at scale. ColdGuess tackles the large-scale graph by consolidated nodes, and addresses the cold start problems using homogeneous influence1. The evaluation on real data demonstrates that ColdGuess has stable performance as the number of unknown features increases. It outperforms the lightgbm2 by up to 34 pcp ROC-AUC in a cold start case when a new seller sells a new product . The resulting system, ColdGuess, is effective, adaptable to changing risky seller behavior, and is already in production
    Principal Components Bias in Over-parameterized Linear Models, and its Manifestation in Deep Neural Networks. (arXiv:2105.05553v7 [cs.LG] UPDATED)
    Recent work suggests that convolutional neural networks of different architectures learn to classify images in the same order. To understand this phenomenon, we revisit the over-parametrized deep linear network model. Our analysis reveals that, when the hidden layers are wide enough, the convergence rate of this model's parameters is exponentially faster along the directions of the larger principal components of the data, at a rate governed by the corresponding singular values. We term this convergence pattern the Principal Components bias (PC-bias). Empirically, we show how the PC-bias streamlines the order of learning of both linear and non-linear networks, more prominently at earlier stages of learning. We then compare our results to the simplicity bias, showing that both biases can be seen independently, and affect the order of learning in different ways. Finally, we discuss how the PC-bias may explain some benefits of early stopping and its connection to PCA, and why deep networks converge more slowly with random labels.
    Fast & Furious: Modelling Malware Detection as Evolving Data Streams. (arXiv:2205.12311v1 [cs.CR])
    Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted machine learning (ML) pipelines. However, malware developers unceasingly change their samples features to bypass detection. This constant evolution of malware samples causes changes to the data distribution (i.e., concept drifts) that directly affect ML model detection rates. In this work, we evaluate the impact of concept drift on malware classifiers for two Android datasets: DREBIN (~130K apps) and AndroZoo (~350K apps). Android is a ubiquitous operating system for smartphones, which stimulates attackers to regularly create and update malware to the platform. We conducted a longitudinal evaluation by (i) classifying malware samples collected over nine years (2009-2018), (ii) reviewing concept drift detection algorithms to attest its pervasiveness, (iii) comparing distinct ML approaches to mitigate the issue, and (iv) proposing an ML data stream pipeline that outperformed literature approaches. As a result, we observed that updating every component of the pipeline in response to concept drifts allows the classification model to achieve increasing detection rates as the data representation (extracted features) is updated. Furthermore, we discuss the impact of the changes on the classification models by comparing the variations in the extracted features.
    TorchNTK: A Library for Calculation of Neural Tangent Kernels of PyTorch Models. (arXiv:2205.12372v1 [cs.LG])
    We introduce torchNTK, a python library to calculate the empirical neural tangent kernel (NTK) of neural network models in the PyTorch framework. We provide an efficient method to calculate the NTK of multilayer perceptrons. We compare the explicit differentiation implementation against autodifferentiation implementations, which have the benefit of extending the utility of the library to any architecture supported by PyTorch, such as convolutional networks. A feature of the library is that we expose the user to layerwise NTK components, and show that in some regimes a layerwise calculation is more memory efficient. We conduct preliminary experiments to demonstrate use cases for the software and probe the NTK.
    Multi-Agent Low-Dimensional Linear Bandits. (arXiv:2007.01442v4 [cs.LG] UPDATED)
    We study a multi-agent stochastic linear bandit with side information, parameterized by an unknown vector $\theta^* \in \mathbb{R}^d$. The side information consists of a finite collection of low-dimensional subspaces, one of which contains $\theta^*$. In our setting, agents can collaborate to reduce regret by sending recommendations across a communication graph connecting them. We present a novel decentralized algorithm, where agents communicate subspace indices with each other and each agent plays a projected variant of LinUCB on the corresponding (low-dimensional) subspace. By distributing the search for the optimal subspace across users and learning of the unknown vector by each agent in the corresponding low-dimensional subspace, we show that the per-agent finite-time regret is much smaller than the case when agents do not communicate. We finally complement these results through simulations.
    EGR: Equivariant Graph Refinement and Assessment of 3D Protein Complex Structures. (arXiv:2205.10390v2 [cs.LG] UPDATED)
    Protein complexes are macromolecules essential to the functioning and well-being of all living organisms. As the structure of a protein complex, in particular its region of interaction between multiple protein subunits (i.e., chains), has a notable influence on the biological function of the complex, computational methods that can quickly and effectively be used to refine and assess the quality of a protein complex's 3D structure can directly be used within a drug discovery pipeline to accelerate the development of new therapeutics and improve the efficacy of future vaccines. In this work, we introduce the Equivariant Graph Refiner (EGR), a novel E(3)-equivariant graph neural network (GNN) for multi-task structure refinement and assessment of protein complexes. Our experiments on new, diverse protein complex datasets, all of which we make publicly available in this work, demonstrate the state-of-the-art effectiveness of EGR for atomistic refinement and assessment of protein complexes and outline directions for future work in the field. In doing so, we establish a baseline for future studies in macromolecular refinement and structure analysis.
    K-12BERT: BERT for K-12 education. (arXiv:2205.12335v1 [cs.CL])
    Online education platforms are powered by various NLP pipelines, which utilize models like BERT to aid in content curation. Since the inception of the pre-trained language models like BERT, there have also been many efforts toward adapting these pre-trained models to specific domains. However, there has not been a model specifically adapted for the education domain (particularly K-12) across subjects to the best of our knowledge. In this work, we propose to train a language model on a corpus of data curated by us across multiple subjects from various sources for K-12 education. We also evaluate our model, K12-BERT, on downstream tasks like hierarchical taxonomy tagging.
    A comparative study of non-deep learning, deep learning, and ensemble learning methods for sunspot number prediction. (arXiv:2203.05757v2 [astro-ph.SR] UPDATED)
    Solar activity has significant impacts on human activities and health. One most commonly used measure of solar activity is the sunspot number. This paper compares three important non-deep learning models, four popular deep learning models, and their five ensemble models in forecasting sunspot numbers. In particular, we propose an ensemble model called XGBoost-DL, which uses XGBoost as a two-level nonlinear ensemble method to combine the deep learning models. Our XGBoost-DL achieves the best forecasting performance (RMSE = 25.70 and MAE = 19.82) in the comparison, outperforming the best non-deep learning model SARIMA (RMSE = 54.11 and MAE = 45.51), the best deep learning model Informer (RMSE = 29.90 and MAE = 22.35) and the NASA's forecast (RMSE = 48.38 and MAE = 38.45). Our XGBoost-DL forecasts a peak sunspot number of 133.47 in May 2025 for Solar Cycle 25 and 164.62 in November 2035 for Solar Cycle 26, similar to but later than the NASA's at 137.7 in October 2024 and 161.2 in December 2034. An open-source Python package of our XGBoost-DL for the sunspot number prediction is available at https://github.com/yd1008/ts_ensemble_sunspot.
    Should You Mask 15% in Masked Language Modeling?. (arXiv:2202.08005v2 [cs.CL] UPDATED)
    Masked language models conventionally use a masking rate of 15% due to the belief that more masking would provide insufficient context to learn good representations, and less masking would make training too expensive. Surprisingly, we find that masking up to 40% of input tokens can outperform the 15% baseline, and even masking 80% can preserve most of the performance, as measured by finetuning on downstream tasks. Increasing the masking rates has two distinct effects, which we investigate through careful ablations: (1) A larger proportion of input tokens are corrupted, reducing the context size and creating a harder task, and (2) models perform more predictions, which benefits training. We observe that larger models with more capacity to tackle harder tasks in particular favor higher masking rates. We also find that even more sophisticated masking schemes such as span masking or PMI masking can benefit from higher masking rates, albeit to a smaller extent. Our results contribute to a better understanding of masked language modeling and shed light on more efficient language pre-training.
    Federated Self-supervised Learning for Heterogeneous Clients. (arXiv:2205.12493v1 [cs.LG])
    Federated Learning has become an important learning paradigm due to its privacy and computational benefits. As the field advances, two key challenges that still remain to be addressed are: (1) system heterogeneity - variability in the compute and/or data resources present on each client, and (2) lack of labeled data in certain federated settings. Several recent developments have tried to overcome these challenges independently. In this work, we propose a unified and systematic framework, \emph{Heterogeneous Self-supervised Federated Learning} (Hetero-SSFL) for enabling self-supervised learning with federation on heterogeneous clients. The proposed framework allows collaborative representation learning across all the clients without imposing architectural constraints or requiring presence of labeled data. The key idea in Hetero-SSFL is to let each client train its unique self-supervised model and enable the joint learning across clients by aligning the lower dimensional representations on a common dataset. The entire training procedure could be viewed as self and peer-supervised as both the local training and the alignment procedures do not require presence of any labeled data. As in conventional self-supervised learning, the obtained client models are task independent and can be used for varied end-tasks. We provide a convergence guarantee of the proposed framework for non-convex objectives in heterogeneous settings and also empirically demonstrate that our proposed approach outperforms the state of the art methods by a significant margin.
    Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models. (arXiv:2205.12694v1 [cs.CL])
    Model compression by way of parameter pruning, quantization, or distillation has recently gained popularity as an approach for reducing the computational requirements of modern deep neural network models for NLP. Pruning unnecessary parameters has emerged as a simple and effective method for compressing large models that is compatible with a wide variety of contemporary off-the-shelf hardware (unlike quantization), and that requires little additional training (unlike distillation). Pruning approaches typically take a large, accurate model as input, then attempt to discover a smaller subnetwork of that model capable of achieving end-task accuracy comparable to the full model. Inspired by previous work suggesting a connection between simpler, more generalizable models and those that lie within flat basins in the loss landscape, we propose to directly optimize for flat minima while performing task-specific pruning, which we hypothesize should lead to simpler parameterizations and thus more compressible models. In experiments combining sharpness-aware minimization with both iterative magnitude pruning and structured pruning approaches, we show that optimizing for flat minima consistently leads to greater compressibility of parameters compared to standard Adam optimization when fine-tuning BERT models, leading to higher rates of compression with little to no loss in accuracy on the GLUE classification benchmark.
    Misleading Deep-Fake Detection with GAN Fingerprints. (arXiv:2205.12543v1 [cs.CV])
    Generative adversarial networks (GANs) have made remarkable progress in synthesizing realistic-looking images that effectively outsmart even humans. Although several detection methods can recognize these deep fakes by checking for image artifacts from the generation process, multiple counterattacks have demonstrated their limitations. These attacks, however, still require certain conditions to hold, such as interacting with the detection method or adjusting the GAN directly. In this paper, we introduce a novel class of simple counterattacks that overcomes these limitations. In particular, we show that an adversary can remove indicative artifacts, the GAN fingerprint, directly from the frequency spectrum of a generated image. We explore different realizations of this removal, ranging from filtering high frequencies to more nuanced frequency-peak cleansing. We evaluate the performance of our attack with different detection methods, GAN architectures, and datasets. Our results show that an adversary can often remove GAN fingerprints and thus evade the detection of generated images.
    Inception Transformer. (arXiv:2205.12956v1 [cs.CV])
    Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high- and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to Transformers. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high- and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i.e. gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high- and low-frequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation. For example, our iFormer-S hits the top-1 accuracy of 83.4% on ImageNet-1K, much higher than DeiT-S by 3.6%, and even slightly better than much bigger model Swin-B (83.3%) with only 1/4 parameters and 1/3 FLOPs. Code and models will be released at https://github.com/sail-sg/iFormer.
    Physics Guided Machine Learning for Variational Multiscale Reduced Order Modeling. (arXiv:2205.12419v1 [physics.flu-dyn])
    We propose a new physics guided machine learning (PGML) paradigm that leverages the variational multiscale (VMS) framework and available data to dramatically increase the accuracy of reduced order models (ROMs) at a modest computational cost. The hierarchical structure of the ROM basis and the VMS framework enable a natural separation of the resolved and unresolved ROM spatial scales. Modern PGML algorithms are used to construct novel models for the interaction among the resolved and unresolved ROM scales. Specifically, the new framework builds ROM operators that are closest to the true interaction terms in the VMS framework. Finally, machine learning is used to reduce the projection error and further increase the ROM accuracy. Our numerical experiments for a two-dimensional vorticity transport problem show that the novel PGML-VMS-ROM paradigm maintains the low computational cost of current ROMs, while significantly increasing the ROM accuracy.
    Deep Aesthetic Assessment and Retrieval of Breast Cancer Treatment Outcomes. (arXiv:2205.12611v1 [cs.CV])
    Treatments for breast cancer have continued to evolve and improve in recent years, resulting in a substantial increase in survival rates, with approximately 80\% of patients having a 10-year survival period. Given the serious impact that breast cancer treatments can have on a patient's body image, consequently affecting her self-confidence and sexual and intimate relationships, it is paramount to ensure that women receive the treatment that optimizes both survival and aesthetic outcomes. Currently, there is no gold standard for evaluating the aesthetic outcome of breast cancer treatment. In addition, there is no standard way to show patients the potential outcome of surgery. The presentation of similar cases from the past would be extremely important to manage women's expectations of the possible outcome. In this work, we propose a deep neural network to perform the aesthetic evaluation. As a proof-of-concept, we focus on a binary aesthetic evaluation. Besides its use for classification, this deep neural network can also be used to find the most similar past cases by searching for nearest neighbours in the highly semantic space before classification. We performed the experiments on a dataset consisting of 143 photos of women after conservative treatment for breast cancer. The results for accuracy and balanced accuracy showed the superior performance of our proposed model compared to the state of the art in aesthetic evaluation of breast cancer treatments. In addition, the model showed a good ability to retrieve similar previous cases, with the retrieved cases having the same or adjacent class (in the 4-class setting) and having similar types of asymmetry. Finally, a qualitative interpretability assessment was also performed to analyse the robustness and trustworthiness of the model.
    Low-rank Optimal Transport: Approximation, Statistics and Debiasing. (arXiv:2205.12365v1 [stat.ML])
    The matching principles behind optimal transport (OT) play an increasingly important role in machine learning, a trend which can be observed when OT is used to disambiguate datasets in applications (e.g. single-cell genomics) or used to improve more complex methods (e.g. balanced attention in transformers or self-supervised learning). To scale to more challenging problems, there is a growing consensus that OT requires solvers that can operate on millions, not thousands, of points. The low-rank optimal transport (LOT) approach advocated in \cite{scetbon2021lowrank} holds several promises in that regard, and was shown to complement more established entropic regularization approaches, being able to insert itself in more complex pipelines, such as quadratic OT. LOT restricts the search for low-cost couplings to those that have a low-nonnegative rank, yielding linear time algorithms in cases of interest. However, these promises can only be fulfilled if the LOT approach is seen as a legitimate contender to entropic regularization when compared on properties of interest, where the scorecard typically includes theoretical properties (statistical bounds, relation to other methods) or practical aspects (debiasing, hyperparameter tuning, initialization). We target each of these areas in this paper in order to cement the impact of low-rank approaches in computational OT.
    Ultra-compact Binary Neural Networks for Human Activity Recognition on RISC-V Processors. (arXiv:2205.12781v1 [cs.LG])
    Human Activity Recognition (HAR) is a relevant inference task in many mobile applications. State-of-the-art HAR at the edge is typically achieved with lightweight machine learning models such as decision trees and Random Forests (RFs), whereas deep learning is less common due to its high computational complexity. In this work, we propose a novel implementation of HAR based on deep neural networks, and precisely on Binary Neural Networks (BNNs), targeting low-power general purpose processors with a RISC-V instruction set. BNNs yield very small memory footprints and low inference complexity, thanks to the replacement of arithmetic operations with bit-wise ones. However, existing BNN implementations on general purpose processors impose constraints tailored to complex computer vision tasks, which result in over-parametrized models for simpler problems like HAR. Therefore, we also introduce a new BNN inference library, which targets ultra-compact models explicitly. With experiments on a single-core RISC-V processor, we show that BNNs trained on two HAR datasets obtain higher classification accuracy compared to a state-of-the-art baseline based on RFs. Furthermore, our BNN reaches the same accuracy of a RF with either less memory (up to 91%) or more energy-efficiency (up to 70%), depending on the complexity of the features extracted by the RF.
    VAEL: Bridging Variational Autoencoders and Probabilistic Logic Programming. (arXiv:2202.04178v2 [cs.PL] UPDATED)
    We present VAEL, a neuro-symbolic generative model integrating variational autoencoders (VAE) with the reasoning capabilities of probabilistic logic (L) programming. Besides standard latent subsymbolic variables, our model exploits a probabilistic logic program to define a further structured representation, which is used for logical reasoning. The entire process is end-to-end differentiable. Once trained, VAEL can solve new unseen generation tasks by (i) leveraging the previously acquired knowledge encoded in the neural component and (ii) exploiting new logical programs on the structured latent space. Our experiments provide support on the benefits of this neuro-symbolic integration both in terms of task generalization and data efficiency. To the best of our knowledge, this work is the first to propose a general-purpose end-to-end framework integrating probabilistic logic programming into a deep generative model.
    Towards a Fair Comparison and Realistic Design and Evaluation Framework of Android Malware Detectors. (arXiv:2205.12569v1 [cs.CR])
    As in other cybersecurity areas, machine learning (ML) techniques have emerged as a promising solution to detect Android malware. In this sense, many proposals employing a variety of algorithms and feature sets have been presented to date, often reporting impresive detection performances. However, the lack of reproducibility and the absence of a standard evaluation framework make these proposals difficult to compare. In this paper, we perform an analysis of 10 influential research works on Android malware detection using a common evaluation framework. We have identified five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models and their performances. In particular, we analyze the effect of (1) the presence of duplicated samples, (2) label (goodware/greyware/malware) attribution, (3) class imbalance, (4) the presence of apps that use evasion techniques and, (5) the evolution of apps. Based on this extensive experimentation, we conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results. Our findings also highlight that it is imperative to generate realistic datasets, taking into account the factors mentioned above, to enable the design and evaluation of better solutions for Android malware detection.
    Black box tests for algorithmic stability. (arXiv:2111.15546v2 [cs.LG] UPDATED)
    Algorithmic stability is a concept from learning theory that expresses the degree to which changes to the input data (e.g., removal of a single data point) may affect the outputs of a regression algorithm. Knowing an algorithm's stability properties is often useful for many downstream applications -- for example, stability is known to lead to desirable generalization properties and predictive inference guarantees. However, many modern algorithms currently used in practice are too complex for a theoretical analysis of their stability properties, and thus we can only attempt to establish these properties through an empirical exploration of the algorithm's behavior on various data sets. In this work, we lay out a formal statistical framework for this kind of "black box testing" without any assumptions on the algorithm or the data distribution, and establish fundamental bounds on the ability of any black box test to identify algorithmic stability.
    Deletion and Insertion Tests in Regression Models. (arXiv:2205.12423v1 [cs.LG])
    A basic task in explainable AI (XAI) is to identify the most important features behind a prediction made by a black box function $f$. The insertion and deletion tests of \cite{petsiuk2018rise} are used to judge the quality of algorithms that rank pixels from most to least important for a classification. Motivated by regression problems we establish a formula for their area under the curve (AUC) criteria in terms of certain main effects and interactions in an anchored decomposition of $f$. We find an expression for the expected value of the AUC under a random ordering of inputs to $f$ and propose an alternative area above a straight line for the regression setting. We use this criterion to compare feature importances computed by integrated gradients (IG) to those computed by Kernel SHAP (KS). Exact computation of KS grows exponentially with dimension, while that of IG grows linearly with dimension. In two data sets including binary variables we find that KS is superior to IG in insertion and deletion tests, but only by a very small amount. Our comparison problems include some binary inputs that pose a challenge to IG because it must use values between the possible variable levels. We show that IG will match KS when $f$ is an additive function plus a multilinear function of the variables. This includes a multilinear interpolation over the binary variables that would cause IG to have exponential cost in a naive implementation.
    VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection. (arXiv:2205.12424v1 [cs.CR])
    This paper presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.
    Generating Natural Language Proofs with Verifier-Guided Search. (arXiv:2205.12443v1 [cs.CL])
    Deductive reasoning (drawing conclusions from assumptions) is a challenging problem in NLP. In this work, we focus on proof generation: given a hypothesis and a set of supporting facts in natural language, the model generates a proof tree indicating how to deduce the hypothesis from supporting facts. Instead of generating the entire proof in one shot, prior work has demonstrated the promise of stepwise generation but achieved limited success on real-world data. Existing stepwise methods struggle to generate proof steps that are both valid and relevant. In this paper, we present a novel stepwise method NLProofS (Natural Language Proof Search), which learns to generate relevant steps conditioning on the hypothesis. At the core of our approach, we train an independent verifier to check the validity of proof steps. Instead of generating steps greedily, we search for proofs maximizing a global proof score judged by the verifier. NLProofS achieves state-of-the-art performance on EntailmentBank and RuleTaker. For example, it improves the percentage of correctly predicted proofs from 20.9% to 33.3% in the distractor setting of EntailmentBank. This is the first time stepwise methods have led to better generation of challenging human-authored proofs.
    Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech. (arXiv:2011.01174v5 [eess.AS] UPDATED)
    Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, there are cases where a TTS system generates low-quality speech, mainly caused by limited training data or information loss during knowledge distillation. Therefore, we propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss, which measures the distance between the maximum possible speech quality score and the predicted one. We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech using the pre-trained MOS prediction model. The proposed method can be applied independently regardless of the TTS model architecture or the cause of speech quality degradation and efficiently without increasing the inference time or model complexity. The evaluation results for the MOS and phone error rate demonstrate that our proposed approach improves previous models in terms of both naturalness and intelligibility.
    Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks. (arXiv:2204.02892v2 [cs.CL] UPDATED)
    The field of Natural Language Processing has experienced a dramatic leap in capabilities with the recent introduction of huge Language Models. Despite this success, natural language problems that involve several compounded steps are still practically unlearnable, even by the largest LMs. This complies with experimental failures for end-to-end learning of composite problems that were demonstrated in a variety of domains. An effective mitigation is to introduce intermediate supervision for solving sub-tasks of the compounded problem. Recently, several works have demonstrated high gains by taking a straightforward approach for incorporating intermediate supervision in compounded natural language problems: the sequence-to-sequence LM is fed with an augmented input, in which the decomposed tasks' labels are simply concatenated to the original input. In this paper, we prove a positive learning result that motivates these recent efforts. We show that when concatenating intermediate supervision to the input and training a sequence-to-sequence model on this modified input, unlearnable composite problems can become learnable. We show that this is true for any family of tasks which on the one hand, are unlearnable, and on the other hand, can be decomposed into a polynomial number of simple sub-tasks, each of which depends only on O(1) previous sub-task results. Beyond motivating contemporary empirical efforts for incorporating intermediate supervision in sequence-to-sequence language models, our positive theoretical result is the first of its kind in the landscape of results on the benefits of intermediate supervision for neural-network learning: Until now, all theoretical results on the subject are negative, i.e., show cases where learning is impossible without intermediate supervision, while our result is positive, showing that learning is facilitated in the presence of intermediate supervision.
    Beyond Impossibility: Balancing Sufficiency, Separation and Accuracy. (arXiv:2205.12327v1 [cs.LG])
    Among the various aspects of algorithmic fairness studied in recent years, the tension between satisfying both \textit{sufficiency} and \textit{separation} -- e.g. the ratios of positive or negative predictive values, and false positive or false negative rates across groups -- has received much attention. Following a debate sparked by COMPAS, a criminal justice predictive system, the academic community has responded by laying out important theoretical understanding, showing that one cannot achieve both with an imperfect predictor when there is no equal distribution of labels across the groups. In this paper, we shed more light on what might be still possible beyond the impossibility -- the existence of a trade-off means we should aim to find a good balance within it. After refining the existing theoretical result, we propose an objective that aims to balance \textit{sufficiency} and \textit{separation} measures, while maintaining similar accuracy levels. We show the use of such an objective in two empirical case studies, one involving a multi-objective framework, and the other fine-tuning of a model pre-trained for accuracy. We show promising results, where better trade-offs are achieved compared to existing alternatives.
    sat2pc: Estimating Point Cloud of Building Roofs from 2D Satellite Images. (arXiv:2205.12464v1 [cs.CV])
    Three-dimensional (3D) urban models have gained interest because of their applications in many use-cases such as urban planning and virtual reality. However, generating these 3D representations requires LiDAR data, which are not always readily available. Thus, the applicability of automated 3D model generation algorithms is limited to a few locations. In this paper, we propose sat2pc, a deep learning architecture that predicts the point cloud of a building roof from a single 2D satellite image. Our architecture combines Chamfer distance and EMD loss, resulting in better 2D to 3D performance. We extensively evaluate our model and perform ablation studies on a building roof dataset. Our results show that sat2pc was able to outperform existing baselines by at least 18.6%. Further, we show that the predicted point cloud captures more detail and geometric characteristics than other baselines.
    NECA: Network-Embedded Deep Representation Learning for Categorical Data. (arXiv:2205.12752v1 [cs.LG])
    We propose NECA, a deep representation learning method for categorical data. Built upon the foundations of network embedding and deep unsupervised representation learning, NECA deeply embeds the intrinsic relationship among attribute values and explicitly expresses data objects with numeric vector representations. Designed specifically for categorical data, NECA can support important downstream data mining tasks, such as clustering. Extensive experimental analysis demonstrated the effectiveness of NECA.
    Toward Discovering Options that Achieve Faster Planning. (arXiv:2205.12515v1 [cs.LG])
    We propose a new objective for option discovery that emphasizes the computational advantage of using options in planning. For a given set of episodic tasks and a given number of options, the objective prefers options that can be used to achieve a high return by composing few options. By composing few options, fast planning can be achieved. When faced with new tasks similar to the given ones, the discovered options are also expected to accelerate planning. Our objective extends the objective proposed by Harb et al. (2018) for the single-task setting to the multi-task setting. A closer look at Harb et al.'s objective shows that the best options discovered given one task are not likely to be useful for future unseen tasks and that the multi-task setting is indeed necessary for this purpose. In the same paper, Harb et al. also proposed an algorithm to optimize their objective, and the algorithm can be naturally extended to the multi-task setting. We empirically show that in the four-room domain the extension does not achieve a high objective value and propose a new algorithm that better optimizes the proposed objective. In the same four-room domain, we show that 1) a higher objective value is typically associated with options with which fewer planning iterations are needed to achieve near-optimal performance, 2) our new algorithm achieves a high objective value, which is close to the value achieved by a set of human-designed options, 3) the best number of planning iterations given the discovered options is much smaller and matches it obtained given human-designed options, and 4) the options produced by our algorithm also make intuitive sense because they move to and terminate at cells near hallways connecting two neighbor rooms.
    Mathematical Models of Human Drivers Using Artificial Risk Fields. (arXiv:2205.12722v1 [cs.LG])
    In this paper, we use the concept of artificial risk fields to predict how human operators control a vehicle in response to upcoming road situations. A risk field assigns a non-negative risk measure to the state of the system in order to model how close that state is to violating a safety property, such as hitting an obstacle or exiting the road. Using risk fields, we construct a stochastic model of the operator that maps from states to likely actions. We demonstrate our approach on a driving task wherein human subjects are asked to drive a car inside a realistic driving simulator while avoiding obstacles placed on the road. We show that the most likely risk field given the driving data is obtained by solving a convex optimization problem. Next, we apply the inferred risk fields to generate distinct driving behaviors while comparing predicted trajectories against ground truth measurements. We observe that the risk fields are excellent at predicting future trajectory distributions with high prediction accuracy for up to twenty seconds prediction horizons. At the same time, we observe some challenges such as the inability to account for how drivers choose to accelerate/decelerate based on the road conditions.
    Image Colorization using U-Net with Skip Connections and Fusion Layer on Landscape Images. (arXiv:2205.12867v1 [cs.CV])
    We present a novel technique to automatically colorize grayscale images that combine the U-Net model and Fusion Layer features. This approach allows the model to learn the colorization of images from pre-trained U-Net. Moreover, the Fusion layer is applied to merge local information results dependent on small image patches with global priors of an entire image on each class, forming visually more compelling colorization results. Finally, we validate our approach with a user study evaluation and compare it against state-of-the-art, resulting in improvements.
    Fast Inference and Transfer of Compositional Task Structures for Few-shot Task Generalization. (arXiv:2205.12648v1 [cs.LG])
    We tackle real-world problems with complex structures beyond the pixel-based game or simulator. We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph that defines a set of subtasks and their dependencies that are unknown to the agent. Different from the previous meta-rl methods trying to directly infer the unstructured task embedding, our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks, and use it as a prior to improve the task inference in testing. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks than various existing algorithms such as meta reinforcement learning, hierarchical reinforcement learning, and other heuristic agents.
    Mitigating multiple descents: A model-agnostic framework for risk monotonization. (arXiv:2205.12937v1 [math.ST])
    Recent empirical and theoretical analyses of several commonly used prediction procedures reveal a peculiar risk behavior in high dimensions, referred to as double/multiple descent, in which the asymptotic risk is a non-monotonic function of the limiting aspect ratio of the number of features or parameters to the sample size. To mitigate this undesirable behavior, we develop a general framework for risk monotonization based on cross-validation that takes as input a generic prediction procedure and returns a modified procedure whose out-of-sample prediction risk is, asymptotically, monotonic in the limiting aspect ratio. As part of our framework, we propose two data-driven methodologies, namely zero- and one-step, that are akin to bagging and boosting, respectively, and show that, under very mild assumptions, they provably achieve monotonic asymptotic risk behavior. Our results are applicable to a broad variety of prediction procedures and loss functions, and do not require a well-specified (parametric) model. We exemplify our framework with concrete analyses of the minimum $\ell_2$, $\ell_1$-norm least squares prediction procedures. As one of the ingredients in our analysis, we also derive novel additive and multiplicative forms of oracle risk inequalities for split cross-validation that are of independent interest.
    Lyapunov function approach for approximation algorithm design and analysis: with applications in submodular maximization. (arXiv:2205.12442v1 [math.OC])
    We propose a two-phase systematical framework for approximation algorithm design and analysis via Lyapunov function. The first phase consists of using Lyapunov function as a guideline to design a continuous-time algorithm with provable approximation ratio. The second phase then converts the continuous-time algorithm to a discrete-time algorithm with the same approximation ratio and a provable time complexity. Some immediate benefits of the Lyapunov function approach include: (i) unifying many existing algorithms; (ii) providing a guideline to design and analyze new algorithms; and (iii) offer new perspectives to potentially improve existing algorithms. We use various submodular maximization problems as running examples to illustrate our framework.
    Conformal Prediction Intervals with Temporal Dependence. (arXiv:2205.12940v1 [stat.ML])
    Cross-sectional prediction is common in many domains such as healthcare, including forecasting tasks using electronic health records, where different patients form a cross-section. We focus on the task of constructing valid prediction intervals (PIs) in time-series regression with a cross-section. A prediction interval is considered valid if it covers the true response with (a pre-specified) high probability. We first distinguish between two notions of validity in such a setting: cross-sectional and longitudinal. Cross-sectional validity is concerned with validity across the cross-section of the time series data, while longitudinal validity accounts for the temporal dimension. Coverage guarantees along both these dimensions are ideally desirable; however, we show that distribution-free longitudinal validity is theoretically impossible. Despite this limitation, we propose Conformal Prediction with Temporal Dependence (CPTD), a procedure which is able to maintain strict cross-sectional validity while improving longitudinal coverage. CPTD is post-hoc and light-weight, and can easily be used in conjunction with any prediction model as long as a calibration set is available. We focus on neural networks due to their ability to model complicated data such as diagnosis codes for time-series regression, and perform extensive experimental validation to verify the efficacy of our approach. We find that CPTD outperforms baselines on a variety of datasets by improving longitudinal coverage and often providing more efficient (narrower) PIs.
    Imposing Gaussian Pre-Activations in a Neural Network. (arXiv:2205.12379v1 [cs.LG])
    The goal of the present work is to propose a way to modify both the initialization distribution of the weights of a neural network and its activation function, such that all pre-activations are Gaussian. We propose a family of pairs initialization/activation, where the activation functions span a continuum from bounded functions (such as Heaviside or tanh) to the identity function. This work is motivated by the contradiction between existing works dealing with Gaussian pre-activations: on one side, the works in the line of the Neural Tangent Kernels and the Edge of Chaos are assuming it, while on the other side, theoretical and experimental results challenge this hypothesis. The family of pairs initialization/activation we are proposing will help us to answer this hot question: is it desirable to have Gaussian pre-activations in a neural network?
    xFraud: Explainable Fraud Transaction Detection. (arXiv:2011.12193v3 [cs.LG] UPDATED)
    At online retail platforms, it is crucial to actively detect the risks of transactions to improve customer experience and minimize financial loss. In this work, we propose xFraud, an explainable fraud transaction prediction framework which is mainly composed of a detector and an explainer. The xFraud detector can effectively and efficiently predict the legitimacy of incoming transactions. Specifically, it utilizes a heterogeneous graph neural network to learn expressive representations from the informative heterogeneously typed entities in the transaction logs. The explainer in xFraud can generate meaningful and human-understandable explanations from graphs to facilitate further processes in the business unit. In our experiments with xFraud on real transaction networks with up to 1.1 billion nodes and 3.7 billion edges, xFraud is able to outperform various baseline models in many evaluation metrics while remaining scalable in distributed settings. In addition, we show that xFraud explainer can generate reasonable explanations to significantly assist the business analysis via both quantitative and qualitative evaluations.
    Bayesian Physics-Informed Neural Networks for real-world nonlinear dynamical systems. (arXiv:2205.08304v2 [cs.LG] UPDATED)
    Understanding real-world dynamical phenomena remains a challenging task. Across various scientific disciplines, machine learning has advanced as the go-to technology to analyze nonlinear dynamical systems, identify patterns in big data, and make decision around them. Neural networks are now consistently used as universal function approximators for data with underlying mechanisms that are incompletely understood or exceedingly complex. However, neural networks alone ignore the fundamental laws of physics and often fail to make plausible predictions. Here we integrate data, physics, and uncertainties by combining neural networks, physics-informed modeling, and Bayesian inference to improve the predictive potential of traditional neural network models. We embed the physical model of a damped harmonic oscillator into a fully-connected feed-forward neural network to explore a simple and illustrative model system, the outbreak dynamics of COVID-19. Our Physics-Informed Neural Networks can seamlessly integrate data and physics, robustly solve forward and inverse problems, and perform well for both interpolation and extrapolation, even for a small amount of noisy and incomplete data. At only minor additional cost, they can self-adaptively learn the weighting between data and physics. Combined with Bayesian Neural Networks, they can serve as priors in a Bayesian Inference, and provide credible intervals for uncertainty quantification. Our study reveals the inherent advantages and disadvantages of Neural Networks, Bayesian Inference, and a combination of both and provides valuable guidelines for model selection. While we have only demonstrated these approaches for the simple model problem of a seasonal endemic infectious disease, we anticipate that the underlying concepts and trends generalize to more complex disease conditions and, more broadly, to a wide variety of nonlinear dynamical systems.
    On Representation Knowledge Distillation for Graph Neural Networks. (arXiv:2111.04964v2 [cs.LG] UPDATED)
    Knowledge distillation is a learning paradigm for boosting resource-efficient graph neural networks (GNNs) using more expressive yet cumbersome teacher models. Past work on distillation for GNNs proposed the Local Structure Preserving loss (LSP), which matches local structural relationships defined over edges across the student and teacher's node embeddings. This paper studies whether preserving the global topology of how the teacher embeds graph data can be a more effective distillation objective for GNNs, as real-world graphs often contain latent interactions and noisy edges. We propose Graph Contrastive Representation Distillation (G-CRD), which uses contrastive learning to implicitly preserve global topology by aligning the student node embeddings to those of the teacher in a shared representation space. Additionally, we introduce an expanded set of benchmarks on large-scale real-world datasets where the performance gap between teacher and student GNNs is non-negligible. Experiments across 4 datasets and 14 heterogeneous GNN architectures show that G-CRD consistently boosts the performance and robustness of lightweight GNNs, outperforming LSP (and a global structure preserving variant of LSP) as well as baselines from 2D computer vision. An analysis of the representational similarity among teacher and student embedding spaces reveals that G-CRD balances preserving local and global relationships, while structure preserving approaches are best at preserving one or the other.
    Learning dynamics from partial observations with structured neural ODEs. (arXiv:2205.12550v1 [eess.SY])
    Identifying dynamical systems from experimental data is a notably difficult task. Prior knowledge generally helps, but the extent of this knowledge varies with the application, and customized models are often needed. We propose a flexible framework to incorporate a broad spectrum of physical insight into neural ODE-based system identification, giving physical interpretability to the resulting latent space. This insight is either enforced through hard constraints in the optimization problem or added in its cost function. In order to link the partial and possibly noisy observations to the latent state, we rely on tools from nonlinear observer theory to build a recognition model. We demonstrate the performance of the proposed approach on numerical simulations and on an experimental dataset from a robotic exoskeleton.
    Heterogeneous Reservoir Computing Models for Persian Speech Recognition. (arXiv:2205.12594v1 [cs.SD])
    Over the last decade, deep-learning methods have been gradually incorporated into conventional automatic speech recognition (ASR) frameworks to create acoustic, pronunciation, and language models. Although it led to significant improvements in ASRs' recognition accuracy, due to their hard constraints related to hardware requirements (e.g., computing power and memory usage), it is unclear if such approaches are the most computationally- and energy-efficient options for embedded ASR applications. Reservoir computing (RC) models (e.g., echo state networks (ESNs) and liquid state machines (LSMs)), on the other hand, have been proven inexpensive to train, have vastly fewer parameters, and are compatible with emergent hardware technologies. However, their performance in speech processing tasks is relatively inferior to that of the deep-learning-based models. To enhance the accuracy of the RC in ASR applications, we propose heterogeneous single and multi-layer ESNs to create non-linear transformations of the inputs that capture temporal context at different scales. To test our models, we performed a speech recognition task on the Farsdat Persian dataset. Since, to the best of our knowledge, standard RC has not yet been employed to conduct any Persian ASR tasks, we also trained conventional single-layer and deep ESNs to provide baselines for comparison. Besides, we compared the RC performance with a standard long-short-term memory (LSTM) model. Heterogeneous RC models (1) show improved performance to the standard RC models; (2) perform on par in terms of recognition accuracy with the LSTM, and (3) reduce the training time considerably.
    FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech. (arXiv:2205.12446v1 [cs.CL])
    We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval. In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable speech technology in more languages and catalyze research in low-resource speech understanding.
    MEKER: Memory Efficient Knowledge Embedding Representation for Link Prediction and Question Answering. (arXiv:2204.10629v2 [cs.CL] UPDATED)
    Knowledge Graphs (KGs) are symbolically structured storages of facts. The KG embedding contains concise data used in NLP tasks requiring implicit information about the real world. Furthermore, the size of KGs that may be useful in actual NLP assignments is enormous, and creating embedding over it has memory cost issues. We represent KG as a 3rd-order binary tensor and move beyond the standard CP decomposition by using a data-specific generalized version of it. The generalization of the standard CP-ALS algorithm allows obtaining optimization gradients without a backpropagation mechanism. It reduces the memory needed in training while providing computational benefits. We propose a MEKER, a memory-efficient KG embedding model, which yields SOTA-comparable performance on link prediction tasks and KG-based Question Answering.
    Learn2Agree: Fitting with Multiple Annotators without Objective Ground Truth. (arXiv:2109.03596v2 [cs.LG] UPDATED)
    The annotation of domain experts is important for some medical applications where the objective ground truth is ambiguous to define, e.g., the rehabilitation for some chronic diseases, and the prescreening of some musculoskeletal abnormalities without further medical examinations. However, improper uses of the annotations may hinder developing reliable models. On one hand, forcing the use of a single ground truth generated from multiple annotations is less informative for the modeling. On the other hand, feeding the model with all the annotations without proper regularization is noisy given existing disagreements. For such issues, we propose a novel Learning to Agreement (Learn2Agree) framework to tackle the challenge of learning from multiple annotators without objective ground truth. The framework has two streams, with one stream fitting with the multiple annotators and the other stream learning agreement information between annotators. In particular, the agreement learning stream produces regularization information to the classifier stream, tuning its decision to be better in line with the agreement between annotators. The proposed method can be easily added to existing backbones, with experiments on two medical datasets showed better agreement levels with annotators.
    Uncertainty Quantification for Transport in Porous media using Parameterized Physics Informed neural Networks. (arXiv:2205.12730v1 [cs.CE])
    We present a Parametrization of the Physics Informed Neural Network (P-PINN) approach to tackle the problem of uncertainty quantification in reservoir engineering problems. We demonstrate the approach with the immiscible two phase flow displacement (Buckley-Leverett problem) in heterogeneous porous medium. The reservoir properties (porosity, permeability) are treated as random variables. The distribution of these properties can affect dynamic properties such as the fluids saturation, front propagation speed or breakthrough time. We explore and use to our advantage the ability of networks to interpolate complex high dimensional functions. We observe that the additional dimensions resulting from a stochastic treatment of the partial differential equations tend to produce smoother solutions on quantities of interest (distributions parameters) which is shown to improve the performance of PINNS. We show that provided a proper parameterization of the uncertainty space, PINN can produce solutions that match closely both the ensemble realizations and the stochastic moments. We demonstrate applications for both homogeneous and heterogeneous fields of properties. We are able to solve problems that can be challenging for classical methods. This approach gives rise to trained models that are both more robust to variations in the input space and can compete in performance with traditional stochastic sampling methods.
    An Experimental Comparison Between Temporal Difference and Residual Gradient with Neural Network Approximation. (arXiv:2205.12770v1 [cs.LG])
    Gradient descent or its variants are popular in training neural networks. However, in deep Q-learning with neural network approximation, a type of reinforcement learning, gradient descent (also known as Residual Gradient (RG)) is barely used to solve Bellman residual minimization problem. On the contrary, Temporal Difference (TD), an incomplete gradient descent method prevails. In this work, we perform extensive experiments to show that TD outperforms RG, that is, when the training leads to a small Bellman residual error, the solution found by TD has a better policy and is more robust against the perturbation of neural network parameters. We further use experiments to reveal a key difference between reinforcement learning and supervised learning, that is, a small Bellman residual error can correspond to a bad policy in reinforcement learning while the test loss function in supervised learning is a standard index to indicate the performance. We also empirically examine that the missing term in TD is a key reason why RG performs badly. Our work shows that the performance of a deep Q-learning solution is closely related to the training dynamics and how an incomplete gradient descent method can find a good policy is interesting for future study.
    FBNETGEN: Task-aware GNN-based fMRI Analysis via Functional Brain Network Generation. (arXiv:2205.12465v1 [cs.LG])
    Functional magnetic resonance imaging (fMRI) is one of the most common imaging modalities to investigate brain functions. Recent studies in neuroscience stress the great potential of functional brain networks constructed from fMRI data for clinical predictions. Traditional functional brain networks, however, are noisy and unaware of downstream prediction tasks, while also incompatible with the deep graph neural network (GNN) models. In order to fully unleash the power of GNNs in network-based fMRI analysis, we develop FBNETGEN, a task-aware and interpretable fMRI analysis framework via deep brain network generation. In particular, we formulate (1) prominent region of interest (ROI) features extraction, (2) brain networks generation, and (3) clinical predictions with GNNs, in an end-to-end trainable model under the guidance of particular prediction tasks. Along with the process, the key novel component is the graph generator which learns to transform raw time-series features into task-oriented brain networks. Our learnable graphs also provide unique interpretations by highlighting prediction-related brain regions. Comprehensive experiments on two datasets, i.e., the recently released and currently largest publicly available fMRI dataset Adolescent Brain Cognitive Development (ABCD), and the widely-used fMRI dataset PNC, prove the superior effectiveness and interpretability of FBNETGEN. The implementation is available at https://github.com/Wayfear/FBNETGEN.}
    Multi-Head Online Learning for Delayed Feedback Modeling. (arXiv:2205.12406v1 [cs.LG])
    In online advertising, it is highly important to predict the probability and the value of a conversion (e.g., a purchase). It not only impacts user experience by showing relevant ads, but also affects ROI of advertisers and revenue of marketplaces. Unlike clicks, which often occur within minutes after impressions, conversions are expected to happen over a long period of time (e.g., 30 days for online shopping). It creates a challenge, as the true labels are only available after the long delays. Either inaccurate labels (partial conversions) are used, or models are trained on stale data (e.g., from 30 days ago). The problem is more eminent in online learning, which focuses on the live performance on the latest data. In this paper, a novel solution is presented to address this challenge using multi-head modeling. Unlike traditional methods, it directly quantizes conversions into multiple windows, such as day 1, day 2, day 3-7, and day 8-30. A sub-model is trained specifically on conversions within each window. Label freshness is maximally preserved in early models (e.g., day 1 and day 2), while late conversions are accurately utilized in models with longer delays (e.g., day 8-30). It is shown to greatly exceed the performance of known methods in online learning experiments for both conversion rate (CVR) and value per click (VPC) predictions. Lastly, as a general method for delayed feedback modeling, it can be combined with any advanced ML techniques to further improve the performance.
    DPSNN: A Differentially Private Spiking Neural Network. (arXiv:2205.12718v1 [cs.NE])
    Privacy-preserving is a key problem for the machine learning algorithm. Spiking neural network (SNN) plays an important role in many domains, such as image classification, object detection, and speech recognition, but the study on the privacy protection of SNN is urgently needed. This study combines the differential privacy (DP) algorithm and SNN and proposes differentially private spiking neural network (DPSNN). DP injects noise into the gradient, and SNN transmits information in discrete spike trains so that our differentially private SNN can maintain strong privacy protection while still ensuring high accuracy. We conducted experiments on MNIST, Fashion-MNIST, and the face recognition dataset Extended YaleB. When the privacy protection is improved, the accuracy of the artificial neural network(ANN) drops significantly, but our algorithm shows little change in performance. Meanwhile, we analyzed different factors that affect the privacy protection of SNN. Firstly, the less precise the surrogate gradient is, the better the privacy protection of the SNN. Secondly, the Integrate-And-Fire (IF) neurons perform better than leaky Integrate-And-Fire (LIF) neurons. Thirdly, a large time window contributes more to privacy protection and performance.
    Analytics of Business Time Series Using Machine Learning and Bayesian Inference. (arXiv:2205.12905v1 [cs.LG])
    In the survey we consider the case studies on sales time series forecasting, the deep learning approach for forecasting non-stationary time series using time trend correction, dynamic price and supply optimization using Q-learning, Bitcoin price modeling, COVID-19 spread impact on stock market, using social networks signals in analytics. The use of machine learning and Bayesian inference in predictive analytics has been analyzed.
    A Convergence Theory for Over-parameterized Variational Quantum Eigensolvers. (arXiv:2205.12481v1 [quant-ph])
    The Variational Quantum Eigensolver (VQE) is a promising candidate for quantum applications on near-term Noisy Intermediate-Scale Quantum (NISQ) computers. Despite a lot of empirical studies and recent progress in theoretical understanding of VQE's optimization landscape, the convergence for optimizing VQE is far less understood. We provide the first rigorous analysis of the convergence of VQEs in the over-parameterization regime. By connecting the training dynamics with the Riemannian Gradient Flow on the unit-sphere, we establish a threshold on the sufficient number of parameters for efficient convergence, which depends polynomially on the system dimension and the spectral ratio, a property of the problem Hamiltonian, and could be resilient to gradient noise to some extent. We further illustrate that this overparameterization threshold could be vastly reduced for specific VQE instances by establishing an ansatz-dependent threshold paralleling our main result. We showcase that our ansatz-dependent threshold could serve as a proxy of the trainability of different VQE ansatzes without performing empirical experiments, which hence leads to a principled way of evaluating ansatz design. Finally, we conclude with a comprehensive empirical study that supports our theoretical findings.
    Recipe2Vec: Multi-modal Recipe Representation Learning with Graph Neural Networks. (arXiv:2205.12396v1 [cs.LG])
    Learning effective recipe representations is essential in food studies. Unlike what has been developed for image-based recipe retrieval or learning structural text embeddings, the combined effect of multi-modal information (i.e., recipe images, text, and relation data) receives less attention. In this paper, we formalize the problem of multi-modal recipe representation learning to integrate the visual, textual, and relational information into recipe embeddings. In particular, we first present Large-RG, a new recipe graph data with over half a million nodes, making it the largest recipe graph to date. We then propose Recipe2Vec, a novel graph neural network based recipe embedding model to capture multi-modal information. Additionally, we introduce an adversarial attack strategy to ensure stable learning and improve performance. Finally, we design a joint objective function of node classification and adversarial learning to optimize the model. Extensive experiments demonstrate that Recipe2Vec outperforms state-of-the-art baselines on two classic food study tasks, i.e., cuisine category classification and region prediction. Dataset and codes are available at https://github.com/meettyj/Recipe2Vec.
    Impartial Games: A Challenge for Reinforcement Learning. (arXiv:2205.12787v1 [cs.LG])
    The AlphaZero algorithm and its successor MuZero have revolutionised several competitive strategy games, including chess, Go, and shogi and video games like Atari, by learning to play these games better than any human and any specialised computer program. Aside from knowing the rules, AlphaZero had no prior knowledge of each game. This dramatically advanced progress on a long-standing AI challenge to create programs that can learn for themselves from first principles. Theoretically, there are well-known limits to the power of deep learning for strategy games like chess, Go, and shogi, as they are known to be NEXPTIME hard. Some papers have argued that the AlphaZero methodology has limitations and is unsuitable for general AI. However, none of these works has suggested any specific limits for any particular game. In this paper, we provide more powerful bottlenecks than previously suggested. We present the first concrete example of a game - namely the (children) game of nim - and other impartial games that seem to be a stumbling block for AlphaZero and similar reinforcement learning algorithms. We show experimentally that the bottlenecks apply to both the policy and value networks. Since solving nim can be done in linear time using logarithmic space i.e. has very low-complexity, our experimental results supersede known theoretical limits based on many games' PSPACE (and NEXPTIME) completeness. We show that nim can be learned on small boards, but when the board size increases, AlphaZero style algorithms rapidly fail to improve. We quantify the difficulties for various setups, parameter settings and computational resources. Our results might help expand the AlphaZero self-play paradigm by allowing it to use meta-actions during training and/or actual game play like applying abstract transformations, or reading and writing to an external memory.
    The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training. (arXiv:2205.12502v1 [cs.CV])
    Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog agents solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for visually-grounded dialog, called Generative Self-Training (GST), to leverage unlabeled images on the Web. Specifically, GST first retrieves in-domain images through out-of-distribution detection and generates synthetic dialogs regarding the images via multimodal conditional text generation. GST then trains a dialog agent on the synthetic and the original VisDial data. As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1.2M to 12.9M QA data). For robust training of the generated dialogs, we also propose perplexity-based data selection and multimodal consistency regularization. Evaluation on VisDial v1.0 and v0.9 datasets shows that GST achieves new state-of-the-art results on both datasets. We further observe strong performance gains in the low-data regime (up to 9.35 absolute points on NDCG).
    Investigating Information Inconsistency in Multilingual Open-Domain Question Answering. (arXiv:2205.12456v1 [cs.CL])
    Retrieval based open-domain QA systems use retrieved documents and answer-span selection over retrieved documents to find best-answer candidates. We hypothesize that multilingual Question Answering (QA) systems are prone to information inconsistency when it comes to documents written in different languages, because these documents tend to provide a model with varying information about the same topic. To understand the effects of the biased availability of information and cultural influence, we analyze the behavior of multilingual open-domain question answering models with a focus on retrieval bias. We analyze if different retriever models present different passages given the same question in different languages on TyDi QA and XOR-TyDi QA, two multilingualQA datasets. We speculate that the content differences in documents across languages might reflect cultural divergences and/or social biases.
    Graph-Based Similarity of Neural Network Representations. (arXiv:2111.11165v2 [cs.LG] UPDATED)
    Understanding the black-box representations in Deep Neural Networks (DNN) is an essential problem in deep learning. In this work, we propose Graph-Based Similarity (GBS) to measure the similarity of layer features. Contrary to previous works that compute the similarity directly on the feature maps, GBS measures the correlation based on the graph constructed with hidden layer outputs. By treating each input sample as a node and the corresponding layer output similarity as edges, we construct the graph of DNN representations for each layer. The similarity between graphs of layers identifies the correspondences between representations of models trained in different datasets and initializations. We demonstrate and prove the invariance property of GBS, including invariance to orthogonal transformation and invariance to isotropic scaling, and compare GBS with CKA. GBS shows state-of-the-art performance in reflecting the similarity and provides insights on explaining the adversarial sample behavior on the hidden layer space.
    The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning. (arXiv:2203.06498v7 [cs.LG] UPDATED)
    Recent arguments that machine learning (ML) is facing a reproducibility and replication crisis suggest that some published claims in ML research cannot be taken at face value. These concerns inspire analogies to the replication crisis affecting the social and medical sciences. They also inspire calls for the integration of statistical approaches to causal inference and predictive modeling. A deeper understanding of what reproducibility concerns in supervised ML research have in common with the replication crisis in experimental science puts the new concerns in perspective, and helps researchers avoid "the worst of both worlds," where ML researchers begin borrowing methodologies from explanatory modeling without understanding their limitations and vice versa. We contribute a comparative analysis of concerns about inductive learning that arise in causal attribution as exemplified in psychology versus predictive modeling as exemplified in ML. We identify themes that re-occur in reform discussions, like overreliance on asymptotic theory and non-credible beliefs about real-world data generating processes. We argue that in both fields, claims from learning are implied to generalize outside the specific environment studied (e.g., the input dataset or subject sample, modeling implementation, etc.) but are often impossible to refute due to undisclosed sources of variance in the learning pipeline. In particular, errors being acknowledged in ML expose cracks in long-held beliefs that optimizing predictive accuracy using huge datasets absolves one from having to consider a true data generating process or formally represent uncertainty in performance claims. We conclude by discussing risks that arise when sources of errors are misdiagnosed and the need to acknowledge the role of human inductive biases in learning and reform.
    Trust-based Consensus in Multi-Agent Reinforcement Learning Systems. (arXiv:2205.12880v1 [cs.MA])
    An often neglected issue in multi-agent reinforcement learning (MARL) is the potential presence of unreliable agents in the environment whose deviations from expected behavior can prevent a system from accomplishing its intended tasks. In particular, consensus is a fundamental underpinning problem of cooperative distributed multi-agent systems. Consensus requires different agents, situated in a decentralized communication network, to reach an agreement out of a set of initial proposals that they put forward. Learning-based agents should adopt a protocol that allows them to reach consensus despite having one or more unreliable agents in the system. This paper investigates the problem of unreliable agents in MARL, considering consensus as case study. Echoing established results in the distributed systems literature, our experiments show that even a moderate fraction of such agents can greatly impact the ability of reaching consensus in a networked environment. We propose Reinforcement Learning-based Trusted Consensus (RLTC), a decentralized trust mechanism, in which agents can independently decide which neighbors to communicate with. We empirically demonstrate that our trust mechanism is able to deal with unreliable agents effectively, as evidenced by higher consensus success rates.
    MGX: Near-Zero Overhead Memory Protection for Data-Intensive Accelerators. (arXiv:2004.09679v2 [cs.CR] UPDATED)
    This paper introduces MGX, a near-zero overhead memory protection scheme for hardware accelerators. MGX minimizes the performance overhead of off-chip memory encryption and integrity verification by exploiting the application-specific properties of the accelerator execution. In particular, accelerators tend to explicitly manage data movement between on-chip and off-chip memories. Therefore, the general memory access pattern of an accelerator can largely be determined for a given application. Exploiting these characteristics, MGX generates version numbers used in memory encryption and integrity verification using on-chip accelerator state rather than storing them in the off-chip memory; it also customizes the granularity of the memory protection to match the granularity used by the accelerator. To demonstrate the efficacy of MGX, we present an in-depth study of MGX for DNN and graph algorithms. Experimental results show that on average, MGX lowers the performance overhead of memory protection from 28% and 33% to 4% and 5% for DNN and graph processing accelerators in a wide range of benchmarks, respectively.
    RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning. (arXiv:2205.12548v1 [cs.CL])
    Prompting has shown impressive success in enabling large pretrained language models (LMs) to perform diverse NLP tasks, especially when only few downstream data are available. Automatically finding the optimal prompt for each task, however, is challenging. Most existing work resorts to tuning soft prompt (e.g., embeddings) which falls short of interpretability, reusability across LMs, and applicability when gradients are not accessible. Discrete prompt, on the other hand, is difficult to optimize, and is often created by "enumeration (e.g., paraphrasing)-then-selection" heuristics that do not explore the prompt space systematically. This paper proposes RLPrompt, an efficient discrete prompt optimization approach with reinforcement learning (RL). RLPrompt formulates a parameter-efficient policy network that generates the desired discrete prompt after training with reward. To overcome the complexity and stochasticity of reward signals by the large LM environment, we incorporate effective reward stabilization that substantially enhances the training efficiency. RLPrompt is flexibly applicable to different types of LMs, such as masked (e.g., BERT) and left-to-right models (e.g., GPTs), for both classification and generation tasks. Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing finetuning or prompting methods. Interestingly, the resulting optimized prompts are often ungrammatical gibberish text; and surprisingly, those gibberish prompts are transferrable between different LMs to retain significant performance, indicating LM prompting may not follow human language patterns.
    Towards Understanding Label Regularization for Fine-tuning Pre-trained Language Models. (arXiv:2205.12428v1 [cs.LG])
    Knowledge Distillation (KD) is a prominent neural model compression technique which heavily relies on teacher network predictions to guide the training of a student model. Considering the ever-growing size of pre-trained language models (PLMs), KD is often adopted in many NLP tasks involving PLMs. However, it is evident that in KD, deploying the teacher network during training adds to the memory and computational requirements of training. In the computer vision literature, the necessity of the teacher network is put under scrutiny by showing that KD is a label regularization technique that can be replaced with lighter teacher-free variants such as the label-smoothing technique. However, to the best of our knowledge, this issue is not investigated in NLP. Therefore, this work concerns studying different label regularization techniques and whether we actually need the teacher labels to fine-tune smaller PLM student networks on downstream tasks. In this regard, we did a comprehensive set of experiments on different PLMs such as BERT, RoBERTa, and GPT with more than 600 distinct trials and ran each configuration five times. This investigation led to a surprising observation that KD and other label regularization techniques do not play any meaningful role over regular fine-tuning when the student model is pre-trained. We further explore this phenomenon in different settings of NLP and computer vision tasks and demonstrate that pre-training itself acts as a kind of regularization, and additional label regularization is unnecessary.
    Multimodal active speaker detection and virtual cinematography for video conferencing. (arXiv:2002.03977v3 [eess.AS] UPDATED)
    Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the remote user experience of a video conference by automatically panning, tilting and zooming of a video conferencing camera: users subjectively rate an expert video cinematographer's video significantly higher than unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth camera, and a microphone array; it extracts features from each modality and trains an ASD using an AdaBoost machine learning system that is very efficient and runs in real-time. A VC is similarly trained using machine learning to optimize the subjective quality of the overall experience. To avoid distracting the room participants and reduce switching latency the system has no moving parts -- the VC works by cropping and zooming the 4K wide-FOV video stream. The system was tuned and evaluated using extensive crowdsourcing techniques and evaluated on a dataset with N=100 meetings, each 2-5 minutes in length.
    A Tree-Structured Multi-Task Model Recommender. (arXiv:2203.05092v2 [cs.LG] UPDATED)
    Tree-structured multi-task architectures have been employed to jointly tackle multiple vision tasks in the context of multi-task learning (MTL). The major challenge is to determine where to branch out for each task given a backbone model to optimize for both task accuracy and computation efficiency. To address the challenge, this paper proposes a recommender that, given a set of tasks and a convolutional neural network-based backbone model, automatically suggests tree-structured multi-task architectures that could achieve a high task performance while meeting a user-specified computation budget without performing model training. Extensive evaluations on popular MTL benchmarks show that the recommended architectures could achieve competitive task accuracy and computation efficiency compared with state-of-the-art MTL methods. Our tree-structured multi-task model recommender is open-sourced and available at https://github.com/zhanglijun95/TreeMTL.
    Exact Phase Transitions in Deep Learning. (arXiv:2205.12510v1 [cs.LG])
    This work reports deep-learning-unique first-order and second-order phase transitions, whose phenomenology closely follows that in statistical physics. In particular, we prove that the competition between prediction error and model complexity in the training loss leads to the second-order phase transition for nets with one hidden layer and the first-order phase transition for nets with more than one hidden layer. The proposed theory is directly relevant to the optimization of neural networks and points to an origin of the posterior collapse problem in Bayesian deep learning.
    Meta-Learning-Based Robust Adaptive Flight Control Under Uncertain Wind Conditions. (arXiv:2103.01932v3 [cs.RO] UPDATED)
    Realtime model learning proves challenging for complex dynamical systems, such as drones flying in variable wind conditions. Machine learning technique such as deep neural networks have high representation power but is often too slow to update onboard. On the other hand, adaptive control relies on simple linear parameter models can update as fast as the feedback control loop. We propose an online composite adaptation method that treats outputs from a deep neural network as a set of basis functions capable of representing different wind conditions. To help with training, meta-learning techniques are used to optimize the network output useful for adaptation. We validate our approach by flying a drone in an open air wind tunnel under varying wind conditions and along challenging trajectories. We compare the result with other adaptive controller with different basis function sets and show improvement over tracking and prediction errors.
    Linear Algorithms for Nonparametric Multiclass Probability Estimation. (arXiv:2205.12460v1 [stat.ME])
    Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) have been developed to estimate class probabilities through ensemble learning for $K$-class problems (Wang, Shen and Liu, 2008; Wang, Zhang and Wu, 2019), where $K$ is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demand polynomial time in $K$. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in $K$. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate finite sample performance.
    Memorization in NLP Fine-tuning Methods. (arXiv:2205.12506v1 [cs.CL])
    Large language models are shown to present privacy risks through memorization of training data, and several recent works have studied such risks for the pre-training phase. Little attention, however, has been given to the fine-tuning phase and it is not well understood how different fine-tuning methods (such as fine-tuning the full model, the model head, and adapter) compare in terms of memorization risk. This presents increasing concern as the "pre-train and fine-tune" paradigm proliferates. In this paper, we empirically study memorization of fine-tuning methods using membership inference and extraction attacks, and show that their susceptibility to attacks is very different. We observe that fine-tuning the head of the model has the highest susceptibility to attacks, whereas fine-tuning smaller adapters appears to be less vulnerable to known extraction attacks.
    The Web Is Your Oyster -- Knowledge-Intensive NLP against a Very Large Web Corpus. (arXiv:2112.09924v2 [cs.CL] UPDATED)
    In order to address increasing demands of real-world applications, the research for knowledge-intensive NLP (KI-NLP) should advance by capturing the challenges of a truly open-domain environment: web-scale knowledge, lack of structure, inconsistent quality and noise. To this end, we propose a new setup for evaluating existing knowledge intensive tasks in which we generalize the background corpus to a universal web snapshot. We investigate a slate of NLP tasks which rely on knowledge - either factual or common sense, and ask systems to use a subset of CCNet - the Sphere corpus - as a knowledge source. In contrast to Wikipedia, otherwise a common background corpus in KI-NLP, Sphere is orders of magnitude larger and better reflects the full diversity of knowledge on the web. Despite potential gaps in coverage, challenges of scale, lack of structure and lower quality, we find that retrieval from Sphere enables a state of the art system to match and even outperform Wikipedia-based models on several tasks. We also observe that while a dense index can outperform a sparse BM25 baseline on Wikipedia, on Sphere this is not yet possible. To facilitate further research and minimise the community's reliance on proprietary, black-box search engines, we share our indices, evaluation metrics and infrastructure.
    Neural Sheaf Diffusion: A Topological Perspective on Heterophily and Oversmoothing in GNNs. (arXiv:2202.04579v2 [cs.LG] UPDATED)
    Cellular sheaves equip graphs with a "geometrical" structure by assigning vector spaces and linear maps to nodes and edges. Graph Neural Networks (GNNs) implicitly assume a graph with a trivial underlying sheaf. This choice is reflected in the structure of the graph Laplacian operator, the properties of the associated diffusion equation, and the characteristics of the convolutional models that discretise this equation. In this paper, we use cellular sheaf theory to show that the underlying geometry of the graph is deeply linked with the performance of GNNs in heterophilic settings and their oversmoothing behaviour. By considering a hierarchy of increasingly general sheaves, we study how the ability of the sheaf diffusion process to achieve linear separation of the classes in the infinite time limit expands. At the same time, we prove that when the sheaf is non-trivial, discretised parametric diffusion processes have greater control than GNNs over their asymptotic behaviour. On the practical side, we study how sheaves can be learned from data. The resulting sheaf diffusion models have many desirable properties that address the limitations of classical graph diffusion equations (and corresponding GNN models) and obtain state-of-the-art results in heterophilic settings. Overall, our work provides new connections between GNNs and algebraic topology and would be of interest to both fields.
    Training Language Models with Memory Augmentation. (arXiv:2205.12674v1 [cs.CL])
    Recent work has improved language models remarkably by equipping them with a non-parametric memory component. However, most existing approaches only introduce memories at testing time, or represent them using a separately trained encoder -- resulting in sub-optimal training of the language model. In this work, we present TRIME, a novel yet simple training approach designed for training language models with memory augmentation. Our approach uses a training objective that directly takes in-batch examples as accessible memory. We also present new methods for memory construction and data batching, which are used for adapting to different sets of memories -- local, long-term, and external memory -- at testing time. We evaluate our approach on multiple language modeling and machine translation benchmarks. We find that simply replacing the vanilla language modeling objective by ours greatly reduces the perplexity, without modifying the model architecture or incorporating extra context (e.g., 18.70 $\to$ 17.76 on WikiText-103). We further augment language models with long-range contexts and external knowledge and demonstrate significant gains over previous memory-augmented approaches.
    lpSpikeCon: Enabling Low-Precision Spiking Neural Network Processing for Efficient Unsupervised Continual Learning on Autonomous Agents. (arXiv:2205.12295v1 [cs.NE])
    Recent advances have shown that SNN-based systems can efficiently perform unsupervised continual learning due to their bio-plausible learning rule, e.g., Spike-Timing-Dependent Plasticity (STDP). Such learning capabilities are especially beneficial for use cases like autonomous agents (e.g., robots and UAVs) that need to continuously adapt to dynamically changing scenarios/environments, where new data gathered directly from the environment may have novel features that should be learned online. Current state-of-the-art works employ high-precision weights (i.e., 32 bit) for both training and inference phases, which pose high memory and energy costs thereby hindering efficient embedded implementations of such systems for battery-driven mobile autonomous systems. On the other hand, precision reduction may jeopardize the quality of unsupervised continual learning due to information loss. Towards this, we propose lpSpikeCon, a novel methodology to enable low-precision SNN processing for efficient unsupervised continual learning on resource-constrained autonomous agents/systems. Our lpSpikeCon methodology employs the following key steps: (1) analyzing the impacts of training the SNN model under unsupervised continual learning settings with reduced weight precision on the inference accuracy; (2) leveraging this study to identify SNN parameters that have a significant impact on the inference accuracy; and (3) developing an algorithm for searching the respective SNN parameter values that improve the quality of unsupervised continual learning. The experimental results show that our lpSpikeCon can reduce weight memory of the SNN model by 8x (i.e., by judiciously employing 4-bit weights) for performing online training with unsupervised continual learning and achieve no accuracy loss in the inference phase, as compared to the baseline model with 32-bit weights across different network sizes.
    Linear Connectivity Reveals Generalization Strategies. (arXiv:2205.12411v1 [cs.LG])
    It is widely accepted in the mode connectivity literature that when two neural networks are trained similarly on the same data, they are connected by a path through parameter space over which test set accuracy is maintained. Under some circumstances, including transfer learning from pretrained models, these paths are presumed to be linear. In contrast to existing results, we find that among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of finetuned models have large barriers of increasing loss on the linear paths between them. On each task, we find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the cluster -- models that occupy separate basins on the surface. By measuring performance on specially-crafted diagnostic datasets, we find that these clusters correspond to different generalization strategies: one cluster behaves like a bag of words model under domain shift, while another cluster uses syntactic heuristics. Our work demonstrates how the geometry of the loss surface can guide models towards different heuristic functions.
    FreDo: Frequency Domain-based Long-Term Time Series Forecasting. (arXiv:2205.12301v1 [cs.LG])
    The ability to forecast far into the future is highly beneficial to many applications, including but not limited to climatology, energy consumption, and logistics. However, due to noise or measurement error, it is questionable how far into the future one can reasonably predict. In this paper, we first mathematically show that due to error accumulation, sophisticated models might not outperform baseline models for long-term forecasting. To demonstrate, we show that a non-parametric baseline model based on periodicity can actually achieve comparable performance to a state-of-the-art Transformer-based model on various datasets. We further propose FreDo, a frequency domain-based neural network model that is built on top of the baseline model to enhance its performance and which greatly outperforms the state-of-the-art model. Finally, we validate that the frequency domain is indeed better by comparing univariate models trained in the frequency v.s. time domain.
    Machine learning method for return direction forecasting of Exchange Traded Funds using classification and regression models. (arXiv:2205.12746v1 [q-fin.CP])
    This article aims to propose and apply a machine learning method to analyze the direction of returns from Exchange Traded Funds (ETFs) using the historical return data of its components, helping to make investment strategy decisions through a trading algorithm. In methodological terms, regression and classification models were applied, using standard datasets from Brazilian and American markets, in addition to algorithmic error metrics. In terms of research results, they were analyzed and compared to those of the Na\"ive forecast and the returns obtained by the buy & hold technique in the same period of time. In terms of risk and return, the models mostly performed better than the control metrics, with emphasis on the linear regression model and the classification models by logistic regression, support vector machine (using the LinearSVC model), Gaussian Naive Bayes and K-Nearest Neighbors, where in certain datasets the returns exceeded by two times and the Sharpe ratio by up to four times those of the buy & hold control model.
    Rethinking Fano's Inequality in Ensemble Learning. (arXiv:2205.12683v1 [cs.LG])
    We propose a fundamental theory on ensemble learning that evaluates a given ensemble system by a well-grounded set of metrics. Previous studies used a variant of Fano's inequality of information theory and derived a lower bound of the classification error rate on the basis of the accuracy and diversity of models. We revisit the original Fano's inequality and argue that the studies did not take into account the information lost when multiple model predictions are combined into a final prediction. To address this issue, we generalize the previous theory to incorporate the information loss. Further, we empirically validate and demonstrate the proposed theory through extensive experiments on actual systems. The theory reveals the strengths and weaknesses of systems on each metric, which will push the theoretical understanding of ensemble learning and give us insights into designing systems.
    CGX: Adaptive System Support for Communication-Efficient Deep Learning. (arXiv:2111.08617v4 [cs.DC] UPDATED)
    The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient point-to-point communication, and in particular via hardware bandwidth overprovisioning. Overprovisioning comes at a cost: there is an order of magnitude price difference between "cloud-grade" servers with such support, relative to their popular "consumer-grade" counterparts, although single server-grade and consumer-grade GPUs can have similar computational envelopes. In this paper, we show that the costly hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for compressed communication in ML applications, for both multi-GPU single-node training, as well as larger-scale multi-node training. CGX is based on two technical advances: \emph{At the system level}, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication. \emph{At the application level}, it provides \emph{seamless, parameter-free} integration with popular frameworks, so that end-users do not have to modify training recipes, nor significant training code. This is complemented by a \emph{layer-wise adaptive compression} technique which dynamically balances compression gains with accuracy preservation. CGX integrates with popular ML frameworks, providing up to 3X speedups for multi-GPU nodes based on commodity hardware, and order-of-magnitude improvements in the multi-node setting, with negligible impact on accuracy.
    Machine learning methods for Schlieren imaging of a plasma channel in tenuous atomic vapor. (arXiv:2205.12731v1 [physics.plasm-ph])
    We investigate the usage of a Schlieren imaging setup to measure the geometrical dimensions of a plasma channel in atomic vapor. Near resonant probe light is used to image the plasma channel in a tenuous vapor and machine learning techniques are tested for extracting quantitative information from the images. By building a database of simulated signals with a range of plasma parameters for training Deep Neural Networks, we demonstrate that they can extract from the Schlieren images reliably and with high accuracy the location, the radius and the maximum ionization fraction of the plasma channel as well as the width of the transition region between the core of the plasma channel and the unionized vapor. We test several different neural network architectures with supervised learning and show that the parameter estimations supplied by the networks are resilient with respect to slight changes of the experimental parameters that may occur in the course of a measurement.
    Scalable Online Change Detection for High-dimensional Data Streams. (arXiv:2205.12706v1 [cs.LG])
    Detecting changes in data streams is a core objective in their analysis and has applications in, say, predictive maintenance, fraud detection, and medicine. A principled approach to detect changes is to compare distributions observed within the stream to each other. However, data streams often are high-dimensional, and changes can be complex, e.g., only manifest themselves in higher moments. The streaming setting also imposes heavy memory and computation restrictions. We propose an algorithm, Maximum Mean Discrepancy Adaptive Windowing (MMDAW), which leverages the well-known Maximum Mean Discrepancy (MMD) two-sample test, and facilitates its efficient online computation on windows whose size it flexibly adapts. As MMD is sensitive to any change in the underlying distribution, our algorithm is a general-purpose non-parametric change detector that fulfills the requirements imposed by the streaming setting. Our experiments show that MMDAW achieves better detection quality than state-of-the-art competitors.
    First Contact: Unsupervised Human-Machine Co-Adaptation via Mutual Information Maximization. (arXiv:2205.12381v1 [cs.LG])
    How can we train an assistive human-machine interface (e.g., an electromyography-based limb prosthesis) to translate a user's raw command signals into the actions of a robot or computer when there is no prior mapping, we cannot ask the user for supervision in the form of action labels or reward feedback, and we do not have prior knowledge of the tasks the user is trying to accomplish? The key idea in this paper is that, regardless of the task, when an interface is more intuitive, the user's commands are less noisy. We formalize this idea as a completely unsupervised objective for optimizing interfaces: the mutual information between the user's command signals and the induced state transitions in the environment. To evaluate whether this mutual information score can distinguish between effective and ineffective interfaces, we conduct an observational study on 540K examples of users operating various keyboard and eye gaze interfaces for typing, controlling simulated robots, and playing video games. The results show that our mutual information scores are predictive of the ground-truth task completion metrics in a variety of domains, with an average Spearman's rank correlation of 0.43. In addition to offline evaluation of existing interfaces, we use our unsupervised objective to learn an interface from scratch: we randomly initialize the interface, have the user attempt to perform their desired tasks using the interface, measure the mutual information score, and update the interface to maximize mutual information through reinforcement learning. We evaluate our method through a user study with 12 participants who perform a 2D cursor control task using a perturbed mouse, and an experiment with one user playing the Lunar Lander game using hand gestures. The results show that we can learn an interface from scratch, without any user supervision or prior knowledge of tasks, in under 30 minutes.
    An Empirical Study on Distribution Shift Robustness From the Perspective of Pre-Training and Data Augmentation. (arXiv:2205.12753v1 [cs.CV])
    The performance of machine learning models under distribution shift has been the focus of the community in recent years. Most of current methods have been proposed to improve the robustness to distribution shift from the algorithmic perspective, i.e., designing better training algorithms to help the generalization in shifted test distributions. This paper studies the distribution shift problem from the perspective of pre-training and data augmentation, two important factors in the practice of deep learning that have not been systematically investigated by existing work. By evaluating seven pre-trained models, including ResNets and ViT's with self-supervision and supervision mode, on five important distribution-shift datasets, from WILDS and DomainBed benchmarks, with five different learning algorithms, we provide the first comprehensive empirical study focusing on pre-training and data augmentation. With our empirical result obtained from 1,330 models, we provide the following main observations: 1) ERM combined with data augmentation can achieve state-of-the-art performance if we choose a proper pre-trained model respecting the data property; 2) specialized algorithms further improve the robustness on top of ERM when handling a specific type of distribution shift, e.g., GroupDRO for spurious correlation and CORAL for large-scale out-of-distribution data; 3) Comparing different pre-training modes, architectures and data sizes, we provide novel observations about pre-training on distribution shift, which sheds light on designing or selecting pre-training strategy for different kinds of distribution shifts. In summary, our empirical study provides a comprehensive baseline for a wide range of pre-training models fine-tuned with data augmentation, which potentially inspires research exploiting the power of pre-training and data augmentation in the future of distribution shift study.
    An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems. (arXiv:2205.12755v1 [cs.LG])
    Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer, a key feature of human learning. Though, state of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks. Also, continual learning, that adds the temporal aspect to multitask, is often focused to the study of common pitfalls such as catastrophic forgetting instead of being studied at a large scale as a critical component to build the next generation artificial intelligence. We propose an evolutionary method that can generate a large scale multitask model, and can support the dynamic and continuous addition of new tasks. The generated multitask model is sparsely activated and integrates a task-based routing that guarantees bounded compute cost and fewer added parameters per task as the model expands. The proposed method relies on a knowledge compartmentalization technique to achieve immunity against catastrophic forgetting and other common pitfalls such as gradient interference and negative transfer. We empirically show that the proposed method can jointly solve and achieve competitive results on 69image classification tasks, for example achieving the best test accuracy reported fora model trained only on public data for competitive tasks such as cifar10: 99.43%.
    MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge. (arXiv:2205.12660v1 [cs.LG])
    Deep neural network (DNN) latency characterization is a time-consuming process and adds significant cost to Neural Architecture Search (NAS) processes when searching for efficient convolutional neural networks for embedded vision applications. DNN Latency is a hardware dependent metric and requires direct measurement or inference on target hardware. A recently introduced latency estimation technique known as MAPLE predicts DNN execution time on previously unseen hardware devices by using hardware performance counters. Leveraging these hardware counters in the form of an implicit prior, MAPLE achieves state-of-the-art performance in latency prediction. Here, we propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency to better account for model stability and robustness. First, by identifying DNN architectures that exhibit a similar latency to each other, we can generate multiple virtual examples to significantly improve the accuracy over MAPLE. Secondly, the hardware specifications are used to determine the similarity between training and test hardware to emphasize training samples captured from comparable devices (domains) and encourages improved domain alignment. Experimental results using a convolution neural network NAS benchmark across different types of devices, including an Intel processor that is now used for embedded vision applications, demonstrate a 5% improvement over MAPLE and 9% over HELP. Furthermore, we include ablation studies to independently assess the benefits of virtual examples and hardware-based sample importance.
    Additive Logistic Mechanism for Privacy-Preserving Self-Supervised Learning. (arXiv:2205.12430v1 [cs.LG])
    We study the privacy risks that are associated with training a neural network's weights with self-supervised learning algorithms. Through empirical evidence, we show that the fine-tuning stage, in which the network weights are updated with an informative and often private dataset, is vulnerable to privacy attacks. To address the vulnerabilities, we design a post-training privacy-protection algorithm that adds noise to the fine-tuned weights and propose a novel differential privacy mechanism that samples noise from the logistic distribution. Compared to the two conventional additive noise mechanisms, namely the Laplace and the Gaussian mechanisms, the proposed mechanism uses a bell-shaped distribution that resembles the distribution of the Gaussian mechanism, and it satisfies pure $\epsilon$-differential privacy similar to the Laplace mechanism. We apply membership inference attacks on both unprotected and protected models to quantify the trade-off between the models' privacy and performance. We show that the proposed protection algorithm can effectively reduce the attack accuracy to roughly 50\%-equivalent to random guessing-while maintaining a performance loss below 5\%.
    PLAtE: A Large-scale Dataset for List Page Web Extraction. (arXiv:2205.12386v1 [cs.CL])
    Recently, neural models have been leveraged to significantly improve the performance of information extraction from semi-structured websites. However, a barrier for continued progress is the small number of datasets large enough to train these models. In this work, we introduce the PLAtE (Pages of Lists Attribute Extraction) dataset as a challenging new web extraction task. PLAtE focuses on shopping data, specifically extractions from product review pages with multiple items. PLAtE encompasses both the tasks of: (1) finding product-list segmentation boundaries and (2) extracting attributes for each product. PLAtE is composed of 53, 905 items from 6, 810 pages, making it the first large-scale list page web extraction dataset. We construct PLAtE by collecting list pages from Common Crawl, then annotating them on Mechanical Turk. Quantitative and qualitative analyses are performed to demonstrate PLAtE has high-quality annotations. We establish strong baseline performance on PLAtE with a SOTA model achieving an F1-score of 0.750 for attribute classification and 0.915 for segmentation, indicating opportunities for future research innovations in web extraction.
    A Universal Error Measure for Input Predictions Applied to Online Graph Problems. (arXiv:2205.12850v1 [cs.DS])
    We introduce a novel measure for quantifying the error in input predictions. The error is based on a minimum-cost hyperedge cover in a suitably defined hypergraph and provides a general template which we apply to online graph problems. The measure captures errors due to absent predicted requests as well as unpredicted actual requests; hence, predicted and actual inputs can be of arbitrary size. We achieve refined performance guarantees for previously studied network design problems in the online-list model, such as Steiner tree and facility location. Further, we initiate the study of learning-augmented algorithms for online routing problems, such as the traveling salesperson problem and dial-a-ride problem, where (transportation) requests arrive over time (online-time model). We provide a general algorithmic framework and we give error-dependent performance bounds that improve upon known worst-case barriers, when given accurate predictions, at the cost of slightly increased worst-case bounds when given predictions of arbitrary quality.
  • Open

    Algorithms for the Communication of Samples. (arXiv:2110.12805v3 [cs.IT] UPDATED)
    The efficient communication of noisy data has applications in several areas of machine learning, such as neural compression or differential privacy, and is also known as reverse channel coding or the channel simulation problem. Here we propose two new coding schemes with practical advantages over existing approaches. First, we introduce ordered random coding (ORC) which uses a simple trick to reduce the coding cost of previous approaches. This scheme further illuminates a connection between schemes based on importance sampling and the so-called Poisson functional representation. Second, we describe a hybrid coding scheme which uses dithered quantization to more efficiently communicate samples from distributions with bounded support.
    A Neural Tangent Kernel Formula for Ensembles of Soft Trees with Arbitrary Architectures. (arXiv:2205.12904v1 [cs.LG])
    A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although it can have various tree architectures, the theoretical properties of their impact are not well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with infinitely many trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.
    Deletion and Insertion Tests in Regression Models. (arXiv:2205.12423v1 [cs.LG])
    A basic task in explainable AI (XAI) is to identify the most important features behind a prediction made by a black box function $f$. The insertion and deletion tests of \cite{petsiuk2018rise} are used to judge the quality of algorithms that rank pixels from most to least important for a classification. Motivated by regression problems we establish a formula for their area under the curve (AUC) criteria in terms of certain main effects and interactions in an anchored decomposition of $f$. We find an expression for the expected value of the AUC under a random ordering of inputs to $f$ and propose an alternative area above a straight line for the regression setting. We use this criterion to compare feature importances computed by integrated gradients (IG) to those computed by Kernel SHAP (KS). Exact computation of KS grows exponentially with dimension, while that of IG grows linearly with dimension. In two data sets including binary variables we find that KS is superior to IG in insertion and deletion tests, but only by a very small amount. Our comparison problems include some binary inputs that pose a challenge to IG because it must use values between the possible variable levels. We show that IG will match KS when $f$ is an additive function plus a multilinear function of the variables. This includes a multilinear interpolation over the binary variables that would cause IG to have exponential cost in a naive implementation.
    To Impute or not to Impute? Missing Data in Treatment Effect Estimation. (arXiv:2202.02096v3 [stat.ML] UPDATED)
    Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the individual and the outcome. Having a treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment divides the population in distinct subpopulations, where estimates across these populations will be biased. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data.
    Learning Mixtures of Linear Dynamical Systems. (arXiv:2201.11211v2 [stat.ML] UPDATED)
    We study the problem of learning a mixture of multiple linear dynamical systems (LDSs) from unlabeled short sample trajectories, each generated by one of the LDS models. Despite the wide applicability of mixture models for time-series data, learning algorithms that come with end-to-end performance guarantees are largely absent from existing literature. There are multiple sources of technical challenges, including but not limited to (1) the presence of latent variables (i.e. the unknown labels of trajectories); (2) the possibility that the sample trajectories might have lengths much smaller than the dimension $d$ of the LDS models; and (3) the complicated temporal dependence inherent to time-series data. To tackle these challenges, we develop a two-stage meta-algorithm, which is guaranteed to efficiently recover each ground-truth LDS model up to error $\tilde{O}(\sqrt{d/T})$, where $T$ is the total sample size. We validate our theoretical studies with numerical experiments, confirming the efficacy of the proposed algorithm.
    Learning Distributions by Generative Adversarial Networks: Approximation and Generalization. (arXiv:2205.12601v1 [cs.LG])
    We study how well generative adversarial networks (GAN) learn probability distributions from finite samples by analyzing the convergence rates of these models. Our analysis is based on a new oracle inequality that decomposes the estimation error of GAN into the discriminator and generator approximation errors, generalization error and optimization error. To estimate the discriminator approximation error, we establish error bounds on approximating H\"older functions by ReLU neural networks, with explicit upper bounds on the Lipschitz constant of the network or norm constraint on the weights. For generator approximation error, we show that neural network can approximately transform a low-dimensional source distribution to a high-dimensional target distribution and bound such approximation error by the width and depth of neural network. Combining the approximation results with generalization bounds of neural networks from statistical learning theory, we establish the convergence rates of GANs in various settings, when the error is measured by a collection of integral probability metrics defined through H\"older classes, including the Wasserstein distance as a special case. In particular, for distributions concentrated around a low-dimensional set, we show that the convergence rates of GANs do not depend on the high ambient dimension, but on the lower intrinsic dimension.
    Residuals-based distributionally robust optimization with covariate information. (arXiv:2012.01088v2 [math.OC] UPDATED)
    We consider data-driven approaches that integrate a machine learning prediction model within distributionally robust optimization (DRO) given limited joint observations of uncertain parameters and covariates. Our framework is flexible in the sense that it can accommodate a variety of regression setups and DRO ambiguity sets. We investigate asymptotic and finite sample properties of solutions obtained using Wasserstein, sample robust optimization, and phi-divergence-based ambiguity sets within our DRO formulations, and explore cross-validation approaches for sizing these ambiguity sets. Through numerical experiments, we validate our theoretical results, study the effectiveness of our approaches for sizing ambiguity sets, and illustrate the benefits of our DRO formulations in the limited data regime even when the prediction model is misspecified.
    A Kernel Stein Test for Comparing Latent Variable Models. (arXiv:1907.00586v4 [stat.ML] UPDATED)
    We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018) to the case of latent variable models, a much more general class than the fully observed models treated previously. The new test, with a properly calibrated threshold, has a well-controlled type-I error. In the case of certain models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative Maximum Mean Discrepancy test, which is based on samples from the models and does not exploit the latent structure.
    Testing for Outliers with Conformal p-values. (arXiv:2104.08279v3 [stat.ME] UPDATED)
    This paper studies the construction of p-values for nonparametric outlier detection, taking a multiple-testing perspective. The goal is to test whether new independent samples belong to the same distribution as a reference data set or are outliers. We propose a solution based on conformal inference, a broadly applicable framework which yields p-values that are marginally valid but mutually dependent for different test points. We prove these p-values are positively dependent and enable exact false discovery rate control, although in a relatively weak marginal sense. We then introduce a new method to compute p-values that are both valid conditionally on the training data and independent of each other for different test points; this paves the way to stronger type-I error guarantees. Our results depart from classical conformal inference as we leverage concentration inequalities rather than combinatorial arguments to establish our finite-sample guarantees. Furthermore, our techniques also yield a uniform confidence bound for the false positive rate of any outlier detection algorithm, as a function of the threshold applied to its raw statistics. Finally, the relevance of our results is demonstrated by numerical experiments on real and simulated data.
    Probabilistic model-error assessment of deep learning proxies: an application to real-time inversion of borehole electromagnetic measurements. (arXiv:2205.12684v1 [physics.geo-ph])
    The advent of fast sensing technologies allows for real-time model updates in many applications where the model parameters are uncertain. Bayesian algorithms, such as ensemble smoothers, offer a real-time probabilistic inversion accounting for uncertainties. However, they rely on the repeated evaluation of the computational models, and deep neural network (DNN) based proxies can be useful to address this computational bottleneck. This paper studies the effects of the approximate nature of the deep learned models and associated model errors during the inversion of extra-deep borehole electromagnetic (EM) measurements, which are critical for geosteering. Using a deep neural network (DNN) as a forward model allows us to perform thousands of model evaluations within seconds, which is very useful for quantifying uncertainties and non-uniqueness in real-time. While significant efforts are usually made to ensure the accuracy of the DNN models, it is known that they contain unknown model errors in the regions not covered by the training data. When DNNs are utilized during inversion of EM measurements, the effects of the model errors could manifest themselves as a bias in the estimated input parameters and, consequently, might result in a low-quality geosteering decision. We present numerical results highlighting the challenges associated with the inversion of EM measurements while neglecting model error. We further demonstrate the utility of a recently proposed flexible iterative ensemble smoother in reducing the effect of model bias by capturing the unknown model errors, thus improving the quality of the estimated subsurface properties for geosteering operation. Moreover, we describe a procedure for identifying inversion multimodality and propose possible solutions to alleviate it in real-time.
    EGR: Equivariant Graph Refinement and Assessment of 3D Protein Complex Structures. (arXiv:2205.10390v2 [cs.LG] UPDATED)
    Protein complexes are macromolecules essential to the functioning and well-being of all living organisms. As the structure of a protein complex, in particular its region of interaction between multiple protein subunits (i.e., chains), has a notable influence on the biological function of the complex, computational methods that can quickly and effectively be used to refine and assess the quality of a protein complex's 3D structure can directly be used within a drug discovery pipeline to accelerate the development of new therapeutics and improve the efficacy of future vaccines. In this work, we introduce the Equivariant Graph Refiner (EGR), a novel E(3)-equivariant graph neural network (GNN) for multi-task structure refinement and assessment of protein complexes. Our experiments on new, diverse protein complex datasets, all of which we make publicly available in this work, demonstrate the state-of-the-art effectiveness of EGR for atomistic refinement and assessment of protein complexes and outline directions for future work in the field. In doing so, we establish a baseline for future studies in macromolecular refinement and structure analysis.
    Additive Logistic Mechanism for Privacy-Preserving Self-Supervised Learning. (arXiv:2205.12430v1 [cs.LG])
    We study the privacy risks that are associated with training a neural network's weights with self-supervised learning algorithms. Through empirical evidence, we show that the fine-tuning stage, in which the network weights are updated with an informative and often private dataset, is vulnerable to privacy attacks. To address the vulnerabilities, we design a post-training privacy-protection algorithm that adds noise to the fine-tuned weights and propose a novel differential privacy mechanism that samples noise from the logistic distribution. Compared to the two conventional additive noise mechanisms, namely the Laplace and the Gaussian mechanisms, the proposed mechanism uses a bell-shaped distribution that resembles the distribution of the Gaussian mechanism, and it satisfies pure $\epsilon$-differential privacy similar to the Laplace mechanism. We apply membership inference attacks on both unprotected and protected models to quantify the trade-off between the models' privacy and performance. We show that the proposed protection algorithm can effectively reduce the attack accuracy to roughly 50\%-equivalent to random guessing-while maintaining a performance loss below 5\%.
    When Is Partially Observable Reinforcement Learning Not Scary?. (arXiv:2204.08967v2 [cs.LG] UPDATED)
    Applications of Reinforcement Learning (RL), in which agents learn to make a sequence of decisions despite lacking complete information about the latent states of the controlled system, that is, they act under partial observability of the states, are ubiquitous. Partially observable RL can be notoriously difficult -- well-known information-theoretic results show that learning partially observable Markov decision processes (POMDPs) requires an exponential number of samples in the worst case. Yet, this does not rule out the existence of large subclasses of POMDPs over which learning is tractable. In this paper we identify such a subclass, which we call weakly revealing POMDPs. This family rules out the pathological instances of POMDPs where observations are uninformative to a degree that makes learning hard. We prove that for weakly revealing POMDPs, a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to guarantee polynomial sample complexity. To the best of our knowledge, this is the first provably sample-efficient result for learning from interactions in overcomplete POMDPs, where the number of latent states can be larger than the number of observations.
    Low-rank Optimal Transport: Approximation, Statistics and Debiasing. (arXiv:2205.12365v1 [stat.ML])
    The matching principles behind optimal transport (OT) play an increasingly important role in machine learning, a trend which can be observed when OT is used to disambiguate datasets in applications (e.g. single-cell genomics) or used to improve more complex methods (e.g. balanced attention in transformers or self-supervised learning). To scale to more challenging problems, there is a growing consensus that OT requires solvers that can operate on millions, not thousands, of points. The low-rank optimal transport (LOT) approach advocated in \cite{scetbon2021lowrank} holds several promises in that regard, and was shown to complement more established entropic regularization approaches, being able to insert itself in more complex pipelines, such as quadratic OT. LOT restricts the search for low-cost couplings to those that have a low-nonnegative rank, yielding linear time algorithms in cases of interest. However, these promises can only be fulfilled if the LOT approach is seen as a legitimate contender to entropic regularization when compared on properties of interest, where the scorecard typically includes theoretical properties (statistical bounds, relation to other methods) or practical aspects (debiasing, hyperparameter tuning, initialization). We target each of these areas in this paper in order to cement the impact of low-rank approaches in computational OT.
    Fast calculation of Gaussian Process multiple-fold cross-validation residuals and their covariances. (arXiv:2101.03108v2 [stat.ME] UPDATED)
    We generalize fast Gaussian process leave-one-out formulae to multiple-fold cross-validation, highlighting in turn in broad settings the covariance structure of cross-validation residuals. The employed approach, that relies on block matrix inversion via Schur complements, is applied to both Simple and Universal Kriging frameworks. We illustrate how resulting covariances affect model diagnostics and how to properly transform residuals in the first place. Beyond that, we examine how accounting for dependency between such residuals affect cross-validation-based estimation of the scale parameter. It is found in two distinct cases, namely in scale estimation and in broader covariance parameter estimation via pseudo-likelihood, that correcting for covariances between cross-validation residuals leads back to maximum likelihood estimation or to an original variation thereof. The proposed fast calculation of Gaussian Process multiple-fold cross-validation residuals is implemented and benchmarked against a naive implementation, all in R language. Numerical experiments highlight the accuracy of our approach as well as the substantial speed-ups that it enables. It is noticeable however, as supported by a discussion on the main drivers of computational costs and by a dedicated numerical benchmark, that speed-ups steeply decline as the number of folds (say, all sharing the same size) decreases. Overall, our results enable fast multiple-fold cross-validation, have direct consequences in GP model diagnostics, and pave the way to future work on hyperparameter fitting as well as on the promising field of goal-oriented fold design.
    Exact Convergence Rates of the Neural Tangent Kernel in the Large Depth Limit. (arXiv:1905.13654v11 [stat.ML] UPDATED)
    Recent work by Jacot et al. (2018) has shown that training a neural network using gradient descent in parameter space is related to kernel gradient descent in function space with respect to the Neural Tangent Kernel (NTK). Lee et al. (2019) built on this result by establishing that the output of a neural network trained using gradient descent can be approximated by a linear model when the network width is large. Indeed, under regularity conditions, the NTK converges to a time-independent kernel in the infinite-width limit. This regime is often called the NTK regime. In parallel, recent works on signal propagation (Poole et al., 2016; Schoenholz et al., 2017; Hayou et al., 2019a) studied the impact of the initialization and the activation function on signal propagation in deep neural networks. In this paper, we connect these two theories by quantifying the impact of the initialization and the activation function on the NTK when the network depth becomes large. In particular, we provide a comprehensive analysis of the convergence rates of the NTK regime to the infinite depth regime.
    Clustering consistency with Dirichlet process mixtures. (arXiv:2205.12924v1 [math.ST])
    Dirichlet process mixtures are flexible non-parametric models, particularly suited to density estimation and probabilistic clustering. In this work we study the posterior distribution induced by Dirichlet process mixtures as the sample size increases, and more specifically focus on consistency for the unknown number of clusters when the observed data are generated from a finite mixture. Crucially, we consider the situation where a prior is placed on the concentration parameter of the underlying Dirichlet process. Previous findings in the literature suggest that Dirichlet process mixtures are typically not consistent for the number of clusters if the concentration parameter is held fixed and data come from a finite mixture. Here we show that consistency for the number of clusters can be achieved if the concentration parameter is adapted in a fully Bayesian way, as commonly done in practice. Our results are derived for data coming from a class of finite mixtures, with mild assumptions on the prior for the concentration parameter and for a variety of choices of likelihood kernels for the mixture.
    Tell me why! Explanations support learning relational and causal structure. (arXiv:2112.03753v3 [cs.LG] UPDATED)
    Inferring the abstract relational and causal structure of the world is a major challenge for reinforcement-learning (RL) agents. For humans, language--particularly in the form of explanations--plays a considerable role in overcoming this challenge. Here, we show that language can play a similar role for deep RL agents in complex environments. While agents typically struggle to acquire relational and causal knowledge, augmenting their experience by training them to predict language descriptions and explanations can overcome these limitations. We show that language can help agents learn challenging relational tasks, and examine which aspects of language contribute to its benefits. We then show that explanations can help agents to infer not only relational but also causal structure. Language can shape the way that agents to generalize out-of-distribution from ambiguous, causally-confounded training, and explanations even allow agents to learn to perform experimental interventions to identify causal relationships. Our results suggest that language description and explanation may be powerful tools for improving agent learning and generalization.
    Non-stationary Bandits with Knapsacks. (arXiv:2205.12427v1 [cs.LG])
    In this paper, we study the problem of bandits with knapsacks (BwK) in a non-stationary environment. The BwK problem generalizes the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm. At each time, the decision maker/player chooses to play an arm, and s/he will receive a reward and consume certain amount of resource from each of the multiple resource types. The objective is to maximize the cumulative reward over a finite horizon subject to some knapsack constraints on the resources. Existing works study the BwK problem under either a stochastic or adversarial environment. Our paper considers a non-stationary environment which continuously interpolates between these two extremes. We first show that the traditional notion of variation budget is insufficient to characterize the non-stationarity of the BwK problem for a sublinear regret due to the presence of the constraints, and then we propose a new notion of global non-stationarity measure. We employ both non-stationarity measures to derive upper and lower bounds for the problem. Our results are based on a primal-dual analysis of the underlying linear programs and highlight the interplay between the constraints and the non-stationarity. Finally, we also extend the non-stationarity measure to the problem of online convex optimization with constraints and obtain new regret bounds accordingly.
    Differentially Private Data Generation Needs Better Features. (arXiv:2205.12900v1 [stat.ML])
    Training even moderately-sized generative models with differentially-private stochastic gradient descent (DP-SGD) is difficult: the required level of noise for reasonable levels of privacy is simply too large. We advocate instead building off a good, relevant representation on public data, then using private data only for "transfer learning." In particular, we minimize the maximum mean discrepancy (MMD) between private target data and the generated distribution, using a kernel based on perceptual features from a public dataset. With the MMD, we can simply privatize the data-dependent term once and for all, rather than introducing noise at each step of optimization as in DP-SGD. Our algorithm allows us to generate CIFAR10-level images faithfully with $\varepsilon \approx 2$, far surpassing the current state of the art, which only models MNIST and FashionMNIST at $\varepsilon \approx 10$. Our work introduces simple yet powerful foundations for reducing the gap between private and non-private deep generative models.
    Imposing Gaussian Pre-Activations in a Neural Network. (arXiv:2205.12379v1 [cs.LG])
    The goal of the present work is to propose a way to modify both the initialization distribution of the weights of a neural network and its activation function, such that all pre-activations are Gaussian. We propose a family of pairs initialization/activation, where the activation functions span a continuum from bounded functions (such as Heaviside or tanh) to the identity function. This work is motivated by the contradiction between existing works dealing with Gaussian pre-activations: on one side, the works in the line of the Neural Tangent Kernels and the Edge of Chaos are assuming it, while on the other side, theoretical and experimental results challenge this hypothesis. The family of pairs initialization/activation we are proposing will help us to answer this hot question: is it desirable to have Gaussian pre-activations in a neural network?
    Understanding Programmatic Weak Supervision via Source-aware Influence Function. (arXiv:2205.12879v1 [cs.LG])
    Programmatic Weak Supervision (PWS) aggregates the source votes of multiple weak supervision sources into probabilistic training labels, which are in turn used to train an end model. With its increasing popularity, it is critical to have some tool for users to understand the influence of each component (e.g., the source vote or training data) in the pipeline and interpret the end model behavior. To achieve this, we build on Influence Function (IF) and propose source-aware IF, which leverages the generation process of the probabilistic labels to decompose the end model's training objective and then calculate the influence associated with each (data, source, class) tuple. These primitive influence score can then be used to estimate the influence of individual component of PWS, such as source vote, supervision source, and training data. On datasets of diverse domains, we demonstrate multiple use cases: (1) interpreting incorrect predictions from multiple angles that reveals insights for debugging the PWS pipeline, (2) identifying mislabeling of sources with a gain of 9%-37% over baselines, and (3) improving the end model's generalization performance by removing harmful components in the training objective (13%-24% better than ordinary IF).
    Transportation-Inequalities, Lyapunov Stability and Sampling for Dynamical Systems on Continuous State Space. (arXiv:2205.12448v1 [stat.ML])
    We study the concentration phenomenon for discrete-time random dynamical systems with an unbounded state space. We develop a heuristic approach towards obtaining exponential concentration inequalities for dynamical systems using an entirely functional analytic framework. We also show that existence of exponential-type Lyapunov function, compared to the purely deterministic setting, not only implies stability but also exponential concentration inequalities for sampling from the stationary distribution, via \emph{transport-entropy inequality} (T-E). These results have significant impact in \emph{reinforcement learning} (RL) and \emph{controls}, leading to exponential concentration inequalities even for unbounded observables, while neither assuming reversibility nor exact knowledge of random dynamical system (assumptions at heart of concentration inequalities in statistical mechanics and Markov diffusion processes).
    Learning from time-dependent streaming data with online stochastic algorithms. (arXiv:2205.12549v1 [cs.LG])
    We study stochastic algorithms in a streaming framework, trained on samples coming from a dependent data source. In this streaming framework, we analyze the convergence of Stochastic Gradient (SG) methods in a non-asymptotic manner; this includes various SG methods such as the well-known stochastic gradient descent (i.e., Robbins-Monro algorithm), mini-batch SG methods, together with their averaged estimates (i.e., Polyak-Ruppert averaged). Our results form a heuristic by linking the level of dependency and convexity to the rest of the model parameters. This heuristic provides new insights into choosing the optimal learning rate, which can help increase the stability of SGbased methods; these investigations suggest large streaming batches with slow decaying learning rates for highly dependent data sources.
    Conformal Prediction Intervals with Temporal Dependence. (arXiv:2205.12940v1 [stat.ML])
    Cross-sectional prediction is common in many domains such as healthcare, including forecasting tasks using electronic health records, where different patients form a cross-section. We focus on the task of constructing valid prediction intervals (PIs) in time-series regression with a cross-section. A prediction interval is considered valid if it covers the true response with (a pre-specified) high probability. We first distinguish between two notions of validity in such a setting: cross-sectional and longitudinal. Cross-sectional validity is concerned with validity across the cross-section of the time series data, while longitudinal validity accounts for the temporal dimension. Coverage guarantees along both these dimensions are ideally desirable; however, we show that distribution-free longitudinal validity is theoretically impossible. Despite this limitation, we propose Conformal Prediction with Temporal Dependence (CPTD), a procedure which is able to maintain strict cross-sectional validity while improving longitudinal coverage. CPTD is post-hoc and light-weight, and can easily be used in conjunction with any prediction model as long as a calibration set is available. We focus on neural networks due to their ability to model complicated data such as diagnosis codes for time-series regression, and perform extensive experimental validation to verify the efficacy of our approach. We find that CPTD outperforms baselines on a variety of datasets by improving longitudinal coverage and often providing more efficient (narrower) PIs.
    Machine learning method for return direction forecasting of Exchange Traded Funds using classification and regression models. (arXiv:2205.12746v1 [q-fin.CP])
    This article aims to propose and apply a machine learning method to analyze the direction of returns from Exchange Traded Funds (ETFs) using the historical return data of its components, helping to make investment strategy decisions through a trading algorithm. In methodological terms, regression and classification models were applied, using standard datasets from Brazilian and American markets, in addition to algorithmic error metrics. In terms of research results, they were analyzed and compared to those of the Na\"ive forecast and the returns obtained by the buy & hold technique in the same period of time. In terms of risk and return, the models mostly performed better than the control metrics, with emphasis on the linear regression model and the classification models by logistic regression, support vector machine (using the LinearSVC model), Gaussian Naive Bayes and K-Nearest Neighbors, where in certain datasets the returns exceeded by two times and the Sharpe ratio by up to four times those of the buy & hold control model.
    Mitigating multiple descents: A model-agnostic framework for risk monotonization. (arXiv:2205.12937v1 [math.ST])
    Recent empirical and theoretical analyses of several commonly used prediction procedures reveal a peculiar risk behavior in high dimensions, referred to as double/multiple descent, in which the asymptotic risk is a non-monotonic function of the limiting aspect ratio of the number of features or parameters to the sample size. To mitigate this undesirable behavior, we develop a general framework for risk monotonization based on cross-validation that takes as input a generic prediction procedure and returns a modified procedure whose out-of-sample prediction risk is, asymptotically, monotonic in the limiting aspect ratio. As part of our framework, we propose two data-driven methodologies, namely zero- and one-step, that are akin to bagging and boosting, respectively, and show that, under very mild assumptions, they provably achieve monotonic asymptotic risk behavior. Our results are applicable to a broad variety of prediction procedures and loss functions, and do not require a well-specified (parametric) model. We exemplify our framework with concrete analyses of the minimum $\ell_2$, $\ell_1$-norm least squares prediction procedures. As one of the ingredients in our analysis, we also derive novel additive and multiplicative forms of oracle risk inequalities for split cross-validation that are of independent interest.
    Linear Algorithms for Nonparametric Multiclass Probability Estimation. (arXiv:2205.12460v1 [stat.ME])
    Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) have been developed to estimate class probabilities through ensemble learning for $K$-class problems (Wang, Shen and Liu, 2008; Wang, Zhang and Wu, 2019), where $K$ is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demand polynomial time in $K$. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in $K$. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate finite sample performance.
    Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty. (arXiv:2110.08607v3 [cs.LG] UPDATED)
    In this paper, we propose a probabilistic physics-guided framework, termed Physics-guided Deep Markov Model (PgDMM). The framework targets the inference of the characteristics and latent structure of nonlinear dynamical systems from measurement data, where exact inference of latent variables is typically intractable. A recently surfaced option pertains to leveraging variational inference to perform approximate inference. In such a scheme, transition and emission functions of the system are parameterized via feed-forward neural networks (deep generative models). However, due to the generalized and highly versatile formulation of neural network functions, the learned latent space often lacks physical interpretation and structured representation. To address this, we bridge physics-based state space models with Deep Markov Models, thus delivering a hybrid modeling framework for unsupervised learning and identification of nonlinear dynamical systems. The proposed framework takes advantage of the expressive power of deep learning, while retaining the driving physics of the dynamical system by imposing physics-driven restrictions on the side of the latent space. We demonstrate the benefits of such a fusion in terms of achieving improved performance on illustrative simulation examples and experimental case studies of nonlinear systems. Our results indicate that the physics-based models involved in the employed transition and emission functions essentially enforce a more structured and physically interpretable latent space, which is essential for enhancing and generalizing the predictive capabilities of deep learning-based models.
    Multi-Agent Low-Dimensional Linear Bandits. (arXiv:2007.01442v4 [cs.LG] UPDATED)
    We study a multi-agent stochastic linear bandit with side information, parameterized by an unknown vector $\theta^* \in \mathbb{R}^d$. The side information consists of a finite collection of low-dimensional subspaces, one of which contains $\theta^*$. In our setting, agents can collaborate to reduce regret by sending recommendations across a communication graph connecting them. We present a novel decentralized algorithm, where agents communicate subspace indices with each other and each agent plays a projected variant of LinUCB on the corresponding (low-dimensional) subspace. By distributing the search for the optimal subspace across users and learning of the unknown vector by each agent in the corresponding low-dimensional subspace, we show that the per-agent finite-time regret is much smaller than the case when agents do not communicate. We finally complement these results through simulations.
    Taming Nonconvexity in Kernel Feature Selection -- Favorable Properties of the Laplace Kernel. (arXiv:2106.09387v3 [math.ST] UPDATED)
    Kernel-based feature selection is an important tool in nonparametric statistics. Despite many practical applications of kernel-based feature selection, there is little statistical theory available to support the method. A core challenge is the objective function of the optimization problems used to define kernel-based feature selection are nonconvex. The literature has only studied the statistical properties of the \emph{global optima}, which is a mismatch, given that the gradient-based algorithms available for nonconvex optimization are only able to guarantee convergence to local minima. Studying the full landscape associated with kernel-based methods, we show that feature selection objectives using the Laplace kernel (and other $\ell_1$ kernels) come with statistical guarantees that other kernels, including the ubiquitous Gaussian kernel (or other $\ell_2$ kernels) do not possess. Based on a sharp characterization of the gradient of the objective function, we show that $\ell_1$ kernels eliminate unfavorable stationary points that appear when using an $\ell_2$ kernel. Armed with this insight, we establish statistical guarantees for $\ell_1$ kernel-based feature selection which do not require reaching the global minima. In particular, we establish model-selection consistency of $\ell_1$-kernel-based feature selection in recovering main effects and hierarchical interactions in the nonparametric setting with $n \sim \log p$ samples.
    Learning the Travelling Salesperson Problem Requires Rethinking Generalization. (arXiv:2006.07054v6 [cs.LG] UPDATED)
    End-to-end training of neural network solvers for graph combinatorial optimization problems such as the Travelling Salesperson Problem (TSP) have seen a surge of interest recently, but remain intractable and inefficient beyond graphs with few hundreds of nodes. While state-of-the-art learning-driven approaches for TSP perform closely to classical solvers when trained on trivially small sizes, they are unable to generalize the learnt policy to larger instances at practical scales. This work presents an end-to-end neural combinatorial optimization pipeline that unifies several recent papers in order to identify the inductive biases, model architectures and learning algorithms that promote generalization to instances larger than those seen in training. Our controlled experiments provide the first principled investigation into such zero-shot generalization, revealing that extrapolating beyond training data requires rethinking the neural combinatorial optimization pipeline, from network layers and learning paradigms to evaluation protocols. Additionally, we analyze recent advances in deep learning for routing problems through the lens of our pipeline and provide new directions to stimulate future research.
    Adaptively Exploiting d-Separators with Causal Bandits. (arXiv:2202.05100v2 [stat.ML] UPDATED)
    Multi-armed bandit problems provide a framework to identify the optimal intervention over a sequence of repeated experiments. Without additional assumptions, minimax optimal performance (measured by cumulative regret) is well-understood. With access to additional observed variables that d-separate the intervention from the outcome (i.e., they are a d-separator), recent "causal bandit" algorithms provably incur less regret. However, in practice it is desirable to be agnostic to whether observed variables are a d-separator. Ideally, an algorithm should be adaptive; that is, perform nearly as well as an algorithm with oracle knowledge of the presence or absence of a d-separator. In this work, we formalize and study this notion of adaptivity, and provide a novel algorithm that simultaneously achieves (a) optimal regret when a d-separator is observed, improving on classical minimax algorithms, and (b) significantly smaller regret than recent causal bandit algorithms when the observed variables are not a d-separator. Crucially, our algorithm does not require any oracle knowledge of whether a d-separator is observed. We also generalize this adaptivity to other conditions, such as the front-door criterion.
    FastAdaBelief: Improving Convergence Rate for Belief-based Adaptive Optimizers by Exploiting Strong Convexity. (arXiv:2104.13790v3 [cs.LG] UPDATED)
    AdaBelief, one of the current best optimizers, demonstrates superior generalization ability compared to the popular Adam algorithm by viewing the exponential moving average of observed gradients. AdaBelief is theoretically appealing in that it has a data-dependent $O(\sqrt{T})$ regret bound when objective functions are convex, where $T$ is a time horizon. It remains however an open problem whether the convergence rate can be further improved without sacrificing its generalization ability. %on how to exploit strong convexity to further improve the convergence rate of AdaBelief. To this end, we make a first attempt in this work and design a novel optimization algorithm called FastAdaBelief that aims to exploit its strong convexity in order to achieve an even faster convergence rate. In particular, by adjusting the step size that better considers strong convexity and prevents fluctuation, our proposed FastAdaBelief demonstrates excellent generalization ability as well as superior convergence. As an important theoretical contribution, we prove that FastAdaBelief attains a data-dependant $O(\log T)$ regret bound, which is substantially lower than AdaBelief. On the empirical side, we validate our theoretical analysis with extensive experiments in both scenarios of strong and non-strong convexity on three popular baseline models. Experimental results are very encouraging: FastAdaBelief converges the quickest in comparison to all mainstream algorithms while maintaining an excellent generalization ability, in cases of both strong or non-strong convexity. FastAdaBelief is thus posited as a new benchmark model for the research community.
    Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret. (arXiv:2205.12418v1 [cs.LG])
    We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.
    Amortized Inference for Causal Structure Learning. (arXiv:2205.12934v1 [cs.LG])
    Learning causal structure poses a combinatorial search problem that typically involves evaluating structures using a score or independence test. The resulting search is costly, and designing suitable scores or tests that capture prior knowledge is difficult. In this work, we propose to amortize the process of causal structure learning. Rather than searching over causal structures directly, we train a variational inference model to predict the causal structure from observational/interventional data. Our inference model acquires domain-specific inductive bias for causal discovery solely from data generated by a simulator. This allows us to bypass both the search over graphs and the hand-engineering of suitable score functions. Moreover, the architecture of our inference model is permutation invariant w.r.t. the data points and permutation equivariant w.r.t. the variables, facilitating generalization to significantly larger problem instances than seen during training. On synthetic data and semi-synthetic gene expression data, our models exhibit robust generalization capabilities under substantial distribution shift and significantly outperform existing algorithms, especially in the challenging genomics domain.
    On Representation Knowledge Distillation for Graph Neural Networks. (arXiv:2111.04964v2 [cs.LG] UPDATED)
    Knowledge distillation is a learning paradigm for boosting resource-efficient graph neural networks (GNNs) using more expressive yet cumbersome teacher models. Past work on distillation for GNNs proposed the Local Structure Preserving loss (LSP), which matches local structural relationships defined over edges across the student and teacher's node embeddings. This paper studies whether preserving the global topology of how the teacher embeds graph data can be a more effective distillation objective for GNNs, as real-world graphs often contain latent interactions and noisy edges. We propose Graph Contrastive Representation Distillation (G-CRD), which uses contrastive learning to implicitly preserve global topology by aligning the student node embeddings to those of the teacher in a shared representation space. Additionally, we introduce an expanded set of benchmarks on large-scale real-world datasets where the performance gap between teacher and student GNNs is non-negligible. Experiments across 4 datasets and 14 heterogeneous GNN architectures show that G-CRD consistently boosts the performance and robustness of lightweight GNNs, outperforming LSP (and a global structure preserving variant of LSP) as well as baselines from 2D computer vision. An analysis of the representational similarity among teacher and student embedding spaces reveals that G-CRD balances preserving local and global relationships, while structure preserving approaches are best at preserving one or the other.
    A scalable multi-step least squares method for network identification with unknown disturbance topology. (arXiv:2106.07548v3 [eess.SY] UPDATED)
    Identification methods for dynamic networks typically require prior knowledge of the network and disturbance topology, and often rely on solving poorly scalable non-convex optimization problems. While methods for estimating network topology are available in the literature, less attention has been paid to estimating the disturbance topology, i.e., the (spatial) noise correlation structure and the noise rank in a filtered white noise representation of the disturbance signal. In this work we present an identification method for dynamic networks, in which an estimation of the disturbance topology precedes the identification of the full dynamic network with known network topology. To this end we extend the multi-step Sequential Linear Regression and Weighted Null Space Fitting methods to deal with reduced rank noise, and use these methods to estimate the disturbance topology and the network dynamics in the full measurement situation. As a result, we provide a multi-step least squares algorithm with parallel computation capabilities and that rely only on explicit analytical solutions, thereby avoiding the usual non-convex optimizations involved. Consequently we consistently estimate dynamic networks of Box Jenkins model structure, while keeping the computational burden low. We provide a consistency proof that includes path-based data informativity conditions for allocation of excitation signals in the experimental design. Numerical simulations performed on a dynamic network with reduced rank noise clearly illustrate the potential of this method.
    Surprises in adversarially-trained linear regression. (arXiv:2205.12695v1 [stat.ML])
    State-of-the-art machine learning models can be vulnerable to very small input perturbations that are adversarially constructed. Adversarial training is one of the most effective approaches to defend against such examples. We show that for linear regression problems, adversarial training can be formulated as a convex problem. This fact is then used to show that $\ell_\infty$-adversarial training produces sparse solutions and has many similarities to the lasso method. Similarly, $\ell_2$-adversarial training has similarities with ridge regression. We use a robust regression framework to analyze and understand these similarities and also point to some differences. Finally, we show how adversarial training behaves differently from other regularization methods when estimating overparameterized models (i.e., models with more parameters than datapoints). It minimizes a sum of three terms which regularizes the solution, but unlike lasso and ridge regression, it can sharply transition into an interpolation mode. We show that for sufficiently many features or sufficiently small regularization parameters, the learned model perfectly interpolates the training data while still exhibiting good out-of-sample performance.  ( 2 min )
    On the Interpretability of Regularisation for Neural Networks Through Model Gradient Similarity. (arXiv:2205.12642v1 [stat.ML])
    Most complex machine learning and modelling techniques are prone to over-fitting and may subsequently generalise poorly to future data. Artificial neural networks are no different in this regard and, despite having a level of implicit regularisation when trained with gradient descent, often require the aid of explicit regularisers. We introduce a new framework, Model Gradient Similarity (MGS), that (1) serves as a metric of regularisation, which can be used to monitor neural network training, (2) adds insight into how explicit regularisers, while derived from widely different principles, operate via the same mechanism underneath by increasing MGS, and (3) provides the basis for a new regularisation scheme which exhibits excellent performance, especially in challenging settings such as high levels of label noise or limited sample sizes.  ( 2 min )
    Rethinking Fano's Inequality in Ensemble Learning. (arXiv:2205.12683v1 [cs.LG])
    We propose a fundamental theory on ensemble learning that evaluates a given ensemble system by a well-grounded set of metrics. Previous studies used a variant of Fano's inequality of information theory and derived a lower bound of the classification error rate on the basis of the accuracy and diversity of models. We revisit the original Fano's inequality and argue that the studies did not take into account the information lost when multiple model predictions are combined into a final prediction. To address this issue, we generalize the previous theory to incorporate the information loss. Further, we empirically validate and demonstrate the proposed theory through extensive experiments on actual systems. The theory reveals the strengths and weaknesses of systems on each metric, which will push the theoretical understanding of ensemble learning and give us insights into designing systems.  ( 2 min )
    Gradient-based explanations for Gaussian Process regression and classification models. (arXiv:2205.12797v1 [cs.LG])
    Gaussian Processes (GPs) have proven themselves as a reliable and effective method in probabilistic Machine Learning. Thanks to recent and current advances, modeling complex data with GPs is becoming more and more feasible. Thus, these types of models are, nowadays, an interesting alternative to Neural and Deep Learning methods, which are arguably the current state-of-the-art in Machine Learning. For the latter, we see an increasing interest in so-called explainable approaches - in essence methods that aim to make a Machine Learning model's decision process transparent to humans. Such methods are particularly needed when illogical or biased reasoning can lead to actual disadvantageous consequences for humans. Ideally, explainable Machine Learning should help detect such flaws in a model and aid a subsequent debugging process. One active line of research in Machine Learning explainability are gradient-based methods, which have been successfully applied to complex neural networks. Given that GPs are closed under differentiation, gradient-based explainability for GPs appears as a promising field of research. This paper is primarily focused on explaining GP classifiers via gradients where, contrary to GP regression, derivative GPs are not straightforward to obtain.  ( 2 min )
    Deep interpretable ensembles. (arXiv:2205.12729v1 [stat.ML])
    Ensembles improve prediction performance and allow uncertainty quantification by aggregating predictions from multiple models. In deep ensembling, the individual models are usually black box neural networks, or recently, partially interpretable semi-structured deep transformation models. However, interpretability of the ensemble members is generally lost upon aggregation. This is a crucial drawback of deep ensembles in high-stake decision fields, in which interpretable models are desired. We propose a novel transformation ensemble which aggregates probabilistic predictions with the guarantee to preserve interpretability and yield uniformly better predictions than the ensemble members on average. Transformation ensembles are tailored towards interpretable deep transformation models but are applicable to a wider range of probabilistic neural networks. In experiments on several publicly available data sets, we demonstrate that transformation ensembles perform on par with classical deep ensembles in terms of prediction performance, discrimination, and calibration. In addition, we demonstrate how transformation ensembles quantify both aleatoric and epistemic uncertainty, and produce minimax optimal predictions under certain conditions.  ( 2 min )
    Multimodal active speaker detection and virtual cinematography for video conferencing. (arXiv:2002.03977v3 [eess.AS] UPDATED)
    Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the remote user experience of a video conference by automatically panning, tilting and zooming of a video conferencing camera: users subjectively rate an expert video cinematographer's video significantly higher than unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth camera, and a microphone array; it extracts features from each modality and trains an ASD using an AdaBoost machine learning system that is very efficient and runs in real-time. A VC is similarly trained using machine learning to optimize the subjective quality of the overall experience. To avoid distracting the room participants and reduce switching latency the system has no moving parts -- the VC works by cropping and zooming the 4K wide-FOV video stream. The system was tuned and evaluated using extensive crowdsourcing techniques and evaluated on a dataset with N=100 meetings, each 2-5 minutes in length.  ( 2 min )

  • Open

    how to start reinforcement learning, as a electrical and electronic control engineer?
    submitted by /u/Ibrahim_Attawil [link] [comments]  ( 1 min )
    Sneak Peak of Animo Island, a PC game that empowers players to explore reinforcement learning as a game mechanic - Currently looking for play testers for an upcoming beta release!
    submitted by /u/AnimoIsland [link] [comments]  ( 1 min )
    What universities are hubs for reinforcement learning research?
    I am currently working through Reinforcement Learning by Sutton and Barto. It is a great book, but since it is such a sub-speciality of machine learning I am wondering what universities have research being done in the field and not just machine learning in general? Where are top papers coming from? submitted by /u/caedin8 [link] [comments]  ( 1 min )
    [D] confusion about KL divergence in PPO (implementation confusion)
    Hi all, I am following this workflow for PPO using keras and I have come across something that doesn't make sense to me, any help and clarification is appreciated - I am following PPO keras . In this line of code when calculating the ratio: def train_policy( observation_buffer, action_buffer, logprobability_buffer, advantage_buffer ): with tf.GradientTape() as tape: # Record operations for automatic differentiation. ratio = tf.exp( logprobabilities(actor(observation_buffer), action_buffer) - logprobability_buffer) I am confused about the subtraction from the logprobability buffer. ​ the line that preceeds this is the collection of information from the buffer: # Get values from the buffer ( observation_buffer, action_buffer, advantage_buffer, return_buffer, logprobability_buffer, ) = buffer.get() I feel like there is some tautology here although i am probably missing something very obvious(with some misunderstanding of KL divergence/importance sampling i'm sure). The log-probabilities function takes the logits from the actor model, then carries out a softmax over those logits - takes the action that was selected, and then returns the log of the action that was selected. but in the train step, we take the observation and actions from the buffer (the same ones for the epoch that has just been run)... call the log probabilities to take a batch of log probs with each state-action pair and the subtract from the log-probabilities from the buffer. however, the log-probabilities from the buffer ARE the same log-probabilities that the actor model will spit out, as they are the same observations and actions that led to log probs being inserted in to the buffer - and, the weights have not been updated by this point and so the softmax will be identical. what is the point of subtracting the same log-probabilities from the log-probability buffer? i don't see here the traditional 'old' vs 'new' policy. thank you! submitted by /u/amjass12 [link] [comments]  ( 2 min )
    do you guys have any ressources/tutorials on how to implement your own RL environment/Agents
    i'm new to RL and i'm trying to use it for a classification task ... yet i'm not quite sure how should i intialize the environment, agents.. any recommendations? submitted by /u/Affectionate_Worth43 [link] [comments]  ( 1 min )
    What are the types of tasks we can perform given an environment in MA Reinforcement Learning?
    Hello, I am new to reinforcement learning and I need to find a game environment and make agents corporate in an optimal way. I found some environments (petting zoo, ML agents Soccer-twos..). I read about RL algorithms such as MADDPG and my intuition was to integrate such algorithms into some environments I found. I want to know if this task is interesting, and any suggestions for tasks we can execute on MARL envs. submitted by /u/Ok_Lab_2750 [link] [comments]  ( 1 min )
    "HyperTree Proof Search for Neural Theorem Proving", Lemple et al 2022 {FB} (56% -> 65% MetaMath proofs)
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    [P] Scale ML experiments from JupyterLab to the cloud
    First Medium article is out! Come see how Optumi is thinking about the shifting workflow needs of data science and machine learning professionals. https://medium.com/@optumi/scale-ml-experiments-from-jupyterlab-to-the-cloud-141bd645d8e9 submitted by /u/chrismarrie [link] [comments]  ( 1 min )
    [D] Google Imagen authors now produce images based on your prompt!
    If you are interested in getting your text converted to an image by Google Brain Imagen use the following link: https://twitter.com/mo_norouzi/status/1529497457234780162?s=20&t=3K_M972bMeGRR2wG6kobHQ submitted by /u/aifordummies [link] [comments]
    [D] From classification to regression and some physics analogies
    Hello, Here is: https://www.researchgate.net/publication/360541388_From_Classification_to_Regression_A_Note_on_Deodata_Predictors The paper describes how to adapt a set of classification algorithms in order to perform nonlinear regression. The algorithms are described with simple numerical examples. In the "Field Sampling Density" section, the described operation is akin to estimating the strength of a field. I am interested in your opinions. Thanks. submitted by /u/crispub [link] [comments]  ( 1 min )
    [N] Pull Requests and Discussions on Hugging Face
    Hey, it's Merve from Hugging Face 👋 I wanted to share some big news that I hope you find useful. The 🤗 Hub now has pull requests (PR) and discussions in repositories to improve collaboration in machine learning 🥳✨ What does PR really mean here? Let’s assume you have a big PyTorch model and someone else ported it to TensorFlow, that person can contribute that to your model repository. Someone else can open a PR to improve your model, fix your machine learning demo in a Space or change anything in the dataset. This applies for model/Space/dataset (any repo) repositories on the hub. You might say this sounds familiar to GitHub. For code, GitHub works super well and we don’t want to (and it would be very inefficient to) recreate the feature set of GitHub. What we want to focus on is creating the collaboration toolset that’s optimized for ML. You can learn more about these new features here: https://huggingface.co/blog/community-update. Looking forward to your feedback and suggestions! ✨ Hope this is useful 🙂 submitted by /u/unofficialmerve [link] [comments]  ( 2 min )
    [D] PyTorch processes taking up tons of GPU memory - any way to reduce this?
    I am running on Arch Linux 5.17.9-arch1-1 with an NVIDIA GeForce RTX 3090 GPU. I need to run multiple processes for a reinforcement learning task, where each subprocess runs the data collection (and inference) and all the samples from that are then retrieved via queues in the main process and optimized (e.g. think of PPO but distributed, like IMPALA). I am using torch.multiprocessing for this. Unfortunately, the multiple spawned subprocesses cause A LOT of overhead in terms of GPU memory being used. See below for my nvidia-smi output: | 0 N/A N/A 559025 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559026 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559027 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559028 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559029 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559030 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559031 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559032 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559033 C ...3/envs/ml/bin/python 1873MiB | So it seems like each subprocess loads the entirety of all of PyTorch into GPU memory, which seems incredibly inefficient. Is there a way to get the subprocesses to only load this once and then share it? How can I reduce the GPU footprint for each process? EDIT: Even using a basic example from the PyTorch repo I can see the same problem: Because it's not forking, it seems to be using up tons of GPU memory for each process. Can this not be fixed? submitted by /u/tmuxed [link] [comments]  ( 2 min )
    [P] ZenML: Build vendor-agnostic, production-ready MLOps pipelines
    Hello r/MachineLearning! Some here might remember we open-sourced ZenML, a year or so ago, and started building it out in the open. Today, we're re-launching it to the world, with a brand-new look, and a sharper focus. We've spoken to hundreds of ML teams in the last year, and here is what we've found: 🐘 Getting ML into production reliably is still hard today. It takes too long, is too complicated, and not enough people know how to do it. 🦡 MLOps platforms are not the answer because they are opinionated, rigid, and slow to change. It's time for MLOps frameworks to shine, and bring structure to the ML ecosystem that is ripe for standardization. 🐼 Well-thought-out abstractions that make sense and are flexible are what the industry needs. Our launch blog post, "The Framework Way is t…  ( 2 min )
    [P] Looking for advice how to apply clustering to learned embeddings of user-item interactions
    hi, I'm working on a consumer segmentation job, where the goal is to understand if there are subgroups of consumers who behave in a similar way to each other, but different from the rest of the population. My dataset contains a user interacting with an item for a period of time, with a reasonable assumption that more time spent means a more favorable view of the item (so we can take the time spent as a proxy for the user's taste). So far, I've created an ML model to learn embeddings of size 64 to represent my approximately 950k users and their interactions with the ~10k items. My original plan was to apply k-means clustering to those 64-dimensional user embeddings. However, this approach isn't yielding the degree of separation I require (e.g. the top most popular items to interact with are all the same ones for each cluster). Trying different values for k, I also get a basically entirely flat elbow graph. How should I proceed from here? I've thought of 2 options: - retraining my embeddings with a smaller size and retrying with k-means - researching an alternative clustering algorithm is there anything I haven't considered yet, but should? If no, which of these 2 approaches would you explore first, and if you prefer the latter, which algorithm(s) would you test out first? Thanks for your help! submitted by /u/the_Wallie [link] [comments]  ( 2 min )
    [D] Different input image size when using Visual Transformers
    I have an image classification problem, and have been using ResNet. The dense layers at the end are replaced with 1x1 convolutions, making the model fully convolutional. Classification is done on 128 x 128 patches, so if the input image is 128 x 128, I'll get output size 1x1. If image is 512 x 512, the output will be 4x4. Each output element will hold the prediction to which class the patch at that position belongs to. Now I'd like to try using transformer instead of ResNet. Can a similar thing be done with Vision Transformers? Are there any examples of that being done? submitted by /u/alkibijad [link] [comments]  ( 1 min )
    [P] Second-tier Recommender System in FunCorp
    Matrix decomposition is not perfect for improvement of recommendation systems. for example, you will find it hard to add gender and age of a user. In this article, we describe how we implemented a second, ranking level of the model above the collaborative one, and how two-stage recommendation systems help us to apply more complex algorithm https://medium.com/@FunCorp/putting-a-two-layered-recommendation-system-into-production-b8caaf61393d submitted by /u/Puzzleheaded_Egg_396 [link] [comments]
    [D] Does TensorFlow Lite use the DropIT method to handle intermediate tensors?
    This blog post (Optimizing TensorFlow Lite Runtime Memory) says that TensorFlow Lite employs different approaches to handle intermediate tensors which occupy large amounts of memory. Is one of them DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training method? submitted by /u/teraRockstar [link] [comments]  ( 1 min )
    [D] ISTM results are out - First International Symposium on the Tsetlin Machine
    ​ ISTM Technical Program You find the full technical program here: https://istm.no/program/ submitted by /u/olegranmo [link] [comments]  ( 1 min )
    [P] Image Background Changer : You can change background to whatever you want.
    This project was made using rembg package that performs image segmentation with U^2-Net. https://reddit.com/link/uxathe/video/chhv2eewbk191/player submitted by /u/supercornson [link] [comments]  ( 1 min )
    [Discussion] Best Affordable way for doing ML online (Colab Pro, etc)
    I have exhausted free gcp credits and I was wondering what are the best (affordable) ways to do machine learning online. Colab Pro (10 USD per month), Kaggle, Paperspace Gradient, etc come to mind. Any thoughts on which is the best? PS: I have used Colab Free version and Kaggle before - The session timeouts that lead you to re-run the notebook from the beginning is the worst experience ever. Also, the only information I could find about Gradient was from the founder, which obviously is not very reliable. Anybody else used it? submitted by /u/BigNet1356 [link] [comments]  ( 4 min )
    [p]Serverless Model training platform for any cloud. Alternative to Sagemaker
    Hi all, We've been working on building a serverless model training platform that can run on any cloud provider. We have a beta version of the platform currently working on Aws and Azure. Looking forward to your feedback and if this is something enterprise users will use? we are at https://cloud.netbook.ai/ submitted by /u/scb_11 [link] [comments]  ( 1 min )
    [P] fastdup: tool for curating computer vision datasets at scale
    https://github.com/visualdatabase/fastdup submitted by /u/gradientflow [link] [comments]
  • Open

    Deep Learning with Label Differential Privacy
    Posted by Pasin Manurangsi and Chiyuan Zhang, Research Scientists, Google Research Over the last several years, there has been an increased focus on developing differential privacy (DP) machine learning (ML) algorithms. DP has been the basis of several practical deployments in industry — and has even been employed by the U.S. Census — because it enables the understanding of system and algorithm privacy guarantees. The underlying assumption of DP is that changing a single user’s contribution to an algorithm should not significantly change its output distribution. In the standard supervised learning setting, a model is trained to make a prediction of the label for each input given a training set of example pairs {[input1,label1], …, [inputn, labeln]}. In the case of deep learning, previ…  ( 8 min )
  • Open

    GA / Evolution noobee question
    Say I have hundred NNs of which 90 perform badly and ten perform "ok" to play a game. Now how do I seed new NNs from these ten? What "makes" a particular NN is its weights. So how do I now generate new NNs from these ten with random mutations? How do I even know that there is something like an "optimal" NN that can actually play the game? With a math background, I think there must be something like a "solution space", and it's not clear to me that a NN obtained through random mutations of the weights is even in a solution space. Thanks! submitted by /u/CantFixMoronic [link] [comments]  ( 1 min )
    Which type of learning is used in an evolutionary simulation?
    I'm creating an evolutionary simulator. I have creatures that I want to teach how to gather food. My creatures have a neural network brain, they have various input values for things such as location, proximity to food etc. They have 2 output neurons - moveX and moveY. Their brains are represented using a genetic code. Upon procreation with another creature, the genes from both parents are randomly inherited and mutations can occur. After a few thousand cycles, the creatures successfully learn how to gather food in the most efficient way possible. What type of learning would this fall into? I wouldn't say it is supervised learning since nobody is labeling the data, nor am I calculating some fitness value. Their survival is determined purely by the simulation conditions. I would guess that this falls under reinforcement learning, but I'm not sure, it might also be unsupervised learning. Can you help me find the proper terminology? submitted by /u/Log_Dogg [link] [comments]  ( 1 min )
    How to Optimize your HuggingFace Transformers
    submitted by /u/aidev2040 [link] [comments]
    A grounded deep symbolic neural network for perception
    Here is a paper titled "A Grounded Deep Symbolic Neural Network for Perception" that I am planning to submit to the NeSy 2022 workshop. It is one of a number of workshops in the Second International Conference on Learning and Reasoning (IJCLR 2022) in Windsor, UK on 28th – 30th September. Papers are due by May 31st. I would appreciate any constructive feedback you would like to make. https://www.adaptroninc.com/sites/default/files/inline-files/Grounded_Deep_Symbolic_Neural_Network_for_comments.pdf Abstract Both animals and artificial intelligent agents rely upon the identification of types of objects and events during perception. It is a categorization process, which senses, recognizes and encodes invariant features. Binary neurons (binons) are general-purpose artificial neural nodes for representing properties, objects, events and relationships between them. Non-symbolic binons are used in short-term memory to represent core sensory properties such as position, intensity and time and ones derived from them. Ratios derived from these properties are converted into invariant symbolic categories using a novel discretization algorithm based on the Weber-Fechner psychophysical laws. Symbolic binons are combined to form deep hierarchical neural networks that comprise long-term memory. It contains spatial and temporal binons representing the shape and contrast patterns for categories of objects and events. They are grounded on the core and derived properties. Empirical evidence of their successful use in classifying handwritten digits was provided by Martensen in 2013[1]. The neural networks are 100% symbolic, transparent, compositional, scalable and sparse. Learning is continuous and unsupervised. submitted by /u/BrettNMartensen [link] [comments]  ( 1 min )
  • Open

    Is diversity the key to collaboration? New AI research suggests so
    A new training approach yields artificial intelligence that adapts to diverse play-styles in a cooperative game, in what could be a win for human-AI teaming.  ( 6 min )
    Early sound exposure in the womb shapes the auditory system
    Modeling study suggests that the muffled environment in utero primes the brain’s ability to interpret some types of sound.  ( 7 min )
  • Open

    Hugging Face now allows Pull Requests and Discussions on repositories
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    Microsoft announces 3 months of unlimited Codex access and free tokens for OpenAI's API
    submitted by /u/Wireless_Life [link] [comments]  ( 1 min )
    How to Optimize your HuggingFace Transformers
    submitted by /u/aidev2040 [link] [comments]
    Artificial Intelligence Implications: The Future of Digital Marketing | Hey Everyone, This is my final post in a university blog series surrounding AI, The Future, and a personal area of interest! Would love some feedback and advice for future blog-style posts!
    submitted by /u/RvZz11 [link] [comments]  ( 1 min )
    GitHub - A complete guide to start and improve in machine learning (ML), artificial intelligence (AI) in 2022 without ANY background in the field and stay up-to-date with the latest news and state-of-the-art techniques!
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    I got GPT-3 to write a novel
    submitted by /u/DavidKShapiro [link] [comments]  ( 1 min )
    DeepMind Researchers Develop A Machine Learning Technique For Accurate Sampling And Free-Energy Estimate Of Solid Materials Using Normalizing Flows
    A significant challenge of computational statistical mechanics is the accurate estimation of equilibrium parameters of a thermodynamic system. For decades, the methods of choice for sampling such systems at large have been molecular dynamics (MD) and hybrid Monte Carlo. Strategies for sampling probability distributions have increased, and most try leveraging normalizing flows. Normalizing Flows are a technique for creating complicated distributions that involve changing a probability density through a sequence of invertible mappings. These are desirable because of 2 characteristics: first, they can create independent samples rapidly and in parallel, and second, they can offer the precise probability density of their creation method. Training a flow-based model to approximate a target distribution yields an efficient but approximate sampler, and re-weighting the samples by their probability density can then be used to remove estimation bias for free energy estimation. The exciting part about flows is that they allow us to obtain accurate estimates even without samples from thermodynamic states. Continue Reading Paper: https://iopscience.iop.org/article/10.1088/2632-2153/ac6b16/pdf Github: https://github.com/deepmind/flows\_for\_atomic\_solids https://preview.redd.it/9z56s4iggl191.png?width=512&format=png&auto=webp&s=4417c7225923b06ae9f585b497e97f3bcbf7c0de submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    [Project]PaddleSpeech: An Easy-to-use Speech Toolkit including SOTA/Streaming ASR witch punctuation, influential TTS with text frontend and the VPR System.
    Hi, all, Glad to share an open source repository PaddleSpeech, which provides SOTA/Streaming ASR witch punctuation, influential TTS with text frontend and a product-ready VPR System. Code:https://github.com/PaddlePaddle/PaddleSpeech Features Set: 📦 Ease of Use: low barriers to install. The CLIs are available to quick-start your project. 🔬 Align to the State-of-the-Art: provide high-speed and ultra-lightweight models, and also cutting-edge technology. 🏆 Streaming ASR and TTS System: provide production ready streaming asr and streaming tts system. 💯 Rule-based frontend: the frontend contains Text Normalization and Grapheme-to-Phoneme (G2P, including Polyphone and Tone Sandhi). 🛎️ Multi-language: both English and Chinese are supported. Examples: Speech Recognition Input wav: Input.wav Output text: I knocked at the door on the ancient side of the building. Text-to-Speech Input text: Life was like a box of chocolates, you never know what you're gonna get. Output wav: Output.wav submitted by /u/Aha_IamDaniel [link] [comments]  ( 1 min )
    Watch how ai make realistic photo from anime
    submitted by /u/Due-Ad9795 [link] [comments]
    AI to track social media behavior of mass shooters
    I know next to nothing about AI, but was wondering if you could use AI to track the behavior or "search" for a pattern of behavior similar to mass shooters. Maybe a more broad question would be, could you develop AI to study human behavior and predict mental state through social media posts. submitted by /u/Lvl-Up-Candy [link] [comments]  ( 1 min )
    Last Week in AI: Autonomous cargo ships, how AI is used in Hollywood, AI to search for guns in public, and more!
    submitted by /u/regalalgorithm [link] [comments]
  • Open

    10 Best Practices For Data Science
    For quite some time now, data science has enjoyed a reputation as the next big revolution in the tech and business landscape. The number of businesses employing the applications of data science has only increased in the recent few years. According to Statista, as of 2021, nearly 60 percent of companies are housing at least… Read More »10 Best Practices For Data Science The post 10 Best Practices For Data Science appeared first on Data Science Central.  ( 7 min )
    Python Book Goodies and Apache Arrow
    In my rundown this week, I cover two distinct topics – a new Python analytics books and the rise of Apache Arrow. A New Python Data Analytics Book Published One of the best-known books on data analysis now has a new edition (3rd edition) available as an early open access The creator of Pandas library,… Read More »Python Book Goodies and Apache Arrow The post Python Book Goodies and Apache Arrow appeared first on Data Science Central.  ( 3 min )
    How Valid-Page Metadata Helps Businesses Grow
    The Internet has added a new document known as valid-page metadata that explores how the Internet processes inconsistent and invalid HTML and deals with issues of invalid mark-up. The Internet updates the help document as the title link with a new section on these headlines and troubleshoots the area.  However, most businesses are not aware… Read More »How Valid-Page Metadata Helps Businesses Grow The post How Valid-Page Metadata Helps Businesses Grow appeared first on Data Science Central.  ( 4 min )
    Why You Need an Augmented Data Integration Tool
    Introduction We are in a time when information is the core element of business success for companies in almost any industry. As technologies emerge and find large-scale adoption, there is an influx of massive amounts of data within enterprises. Two primary challenges need to be solved to obtain the necessary information. First is trustable information… Read More »Why You Need an Augmented Data Integration Tool The post Why You Need an Augmented Data Integration Tool appeared first on Data Science Central.  ( 4 min )
    Pros And Cons of AI In Manufacturing
    The fourth industrial revolution has been a game-changer, with the global economy’s expansion driving the adoption of new technologies across sectors. Manufacturers are using AI software in product design, production, supply chain, and logistics. AI analytics and data are helping in improving product quality and efficiency. Advances in machine learning, artificial intelligence (AI), and Big… Read More »Pros And Cons of AI In Manufacturing The post Pros And Cons of AI In Manufacturing appeared first on Data Science Central.  ( 5 min )
  • Open

    Harmonic e
    Douglas Hofstadter discovered that the 8th harmonic number equals e. OK, not really. The following equation cannot possibly be true because the left side is rational and the right side is irrational. However, Hofstadter showed that the equation does hold if you carry all calculations out to three decimal places. 1.000 0.500 0.333 0.250 0.200 […] Harmonic e first appeared on John D. Cook.  ( 1 min )
  • Open

    Optimality Conditions and Algorithms for Top-K Arm Identification. (arXiv:2205.12086v1 [stat.ML])
    We consider the top-k arm identification problem for multi-armed bandits with rewards belonging to a one-parameter canonical exponential family. The objective is to select the set of k arms with the highest mean rewards by sequential allocation of sampling efforts. We propose a unified optimal allocation problem that identifies the complexity measures of this problem under the fixed-confidence, fixed-budget settings, and the posterior convergence rate from the Bayesian perspective. We provide the first characterization of its optimality. We provide the first provably optimal algorithm in the fixed-confidence setting for k>1. We also propose an efficient heuristic algorithm for the top-k arm identification problem. Extensive numerical experiments demonstrate superior performance compare to existing methods in all three settings.
    Empirical Phase Diagram for Three-layer Neural Networks with Infinite Width. (arXiv:2205.12101v1 [cs.LG])
    Substantial work indicates that the dynamics of neural networks (NNs) is closely related to their initialization of parameters. Inspired by the phase diagram for two-layer ReLU NNs with infinite width (Luo et al., 2021), we make a step towards drawing a phase diagram for three-layer ReLU NNs with infinite width. First, we derive a normalized gradient flow for three-layer ReLU NNs and obtain two key independent quantities to distinguish different dynamical regimes for common initialization methods. With carefully designed experiments and a large computation cost, for both synthetic datasets and real datasets, we find that the dynamics of each layer also could be divided into a linear regime and a condensed regime, separated by a critical regime. The criteria is the relative change of input weights (the input weight of a hidden neuron consists of the weight from its input layer to the hidden neuron and its bias term) as the width approaches infinity during the training, which tends to $0$, $+\infty$ and $O(1)$, respectively. In addition, we also demonstrate that different layers can lie in different dynamical regimes in a training process within a deep NN. In the condensed regime, we also observe the condensation of weights in isolated orientations with low complexity. Through experiments under three-layer condition, our phase diagram suggests a complicated dynamical regimes consisting of three possible regimes, together with their mixture, for deep NNs and provides a guidance for studying deep NNs in different initialization regimes, which reveals the possibility of completely different dynamics emerging within a deep NN for its different layers.
    Graph Convolutional Reinforcement Learning for Collaborative Queuing Agents. (arXiv:2205.12009v1 [cs.NI])
    In this paper, we explore the use of multi-agent deep learning as well as learning to cooperate principles to meet stringent service level agreements, in terms of throughput and end-to-end delay, for a set of classified network flows. We consider agents built on top of a weighted fair queuing algorithm that continuously set weights for three flow groups: gold, silver, and bronze. We rely on a novel graph-convolution based, multi-agent reinforcement learning approach known as DGN. As benchmarks, we propose centralized and distributed deep Q-network approaches and evaluate their performances in different network, traffic, and routing scenarios, highlighting the effectiveness of our proposals and the importance of agent cooperation. We show that our DGN-based approach meets stringent throughput and delay requirements across all scenarios.
    DNNAbacus: Toward Accurate Computational Cost Prediction for Deep Neural Networks. (arXiv:2205.12095v1 [cs.LG])
    Deep learning is attracting interest across a variety of domains, including natural language processing, speech recognition, and computer vision. However, model training is time-consuming and requires huge computational resources. Existing works on the performance prediction of deep neural networks, which mostly focus on the training time prediction of a few models, rely on analytical models and result in high relative errors. %Optimizing task scheduling and reducing job failures in data centers are essential to improve resource utilization and reduce carbon emissions. This paper investigates the computational resource demands of 29 classical deep neural networks and builds accurate models for predicting computational costs. We first analyze the profiling results of typical networks and demonstrate that the computational resource demands of models with different inputs and hyperparameters are not obvious and intuitive. We then propose a lightweight prediction approach DNNAbacus with a novel network structural matrix for network representation. DNNAbacus can accurately predict both memory and time cost for PyTorch and TensorFlow models, which is also generalized to different hardware architectures and can have zero-shot capability for unseen networks. Our experimental results show that the mean relative error (MRE) is 0.9% with respect to time and 2.8% with respect to memory for 29 classic models, which is much lower than the state-of-the-art works.
    Training Efficient CNNS: Tweaking the Nuts and Bolts of Neural Networks for Lighter, Faster and Robust Models. (arXiv:2205.12050v1 [cs.LG])
    Deep Learning has revolutionized the fields of computer vision, natural language understanding, speech recognition, information retrieval and more. Many techniques have evolved over the past decade that made models lighter, faster, and robust with better generalization. However, many deep learning practitioners persist with pre-trained models and architectures trained mostly on standard datasets such as Imagenet, MS-COCO, IMDB-Wiki Dataset, and Kinetics-700 and are either hesitant or unaware of redesigning the architecture from scratch that will lead to better performance. This scenario leads to inefficient models that are not suitable on various devices such as mobile, edge, and fog. In addition, these conventional training methods are of concern as they consume a lot of computing power. In this paper, we revisit various SOTA techniques that deal with architecture efficiency (Global Average Pooling, depth-wise convolutions & squeeze and excitation, Blurpool), learning rate (Cyclical Learning Rate), data augmentation (Mixup, Cutout), label manipulation (label smoothing), weight space manipulation (stochastic weight averaging), and optimizer (sharpness aware minimization). We demonstrate how an efficient deep convolution network can be built in a phased manner by sequentially reducing the number of training parameters and using the techniques mentioned above. We achieved a SOTA accuracy of 99.2% on MNIST data with just 1500 parameters and an accuracy of 86.01% with just over 140K parameters on the CIFAR-10 dataset.
    On statistic alignment for domain adaptation in structural health monitoring. (arXiv:2205.12052v1 [cs.LG])
    The practical application of structural health monitoring (SHM) is often limited by the availability of labelled data. Transfer learning - specifically in the form of domain adaptation (DA) - gives rise to the possibility of leveraging information from a population of physical or numerical structures, by inferring a mapping that aligns the feature spaces. Typical DA methods rely on nonparametric distance metrics, which require sufficient data to perform density estimation. In addition, these methods can be prone to performance degradation under class imbalance. To address these issues, statistic alignment (SA) is discussed, with a demonstration of how these methods can be made robust to class imbalance, including a special case of class imbalance called a partial DA scenario. SA is demonstrated to facilitate damage localisation with no target labels in a numerical case study, outperforming other state-of-the-art DA methods. It is then shown to be capable of aligning the feature spaces of a real heterogeneous population, the Z24 and KW51 bridges, with only 220 samples used from the KW51 bridge. Finally, in scenarios where more complex mappings are required for knowledge transfer, SA is shown to be a vital pre-processing tool, increasing the performance of established DA methods.
    Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation. (arXiv:2205.11140v2 [cs.LG] UPDATED)
    We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. The goal of the agent is to learn the optimal policy which is most preferred by the human overseer. Despite the empirical successes, the theoretical understanding of preference-based RL (PbRL) is only limited to the tabular case. In this paper, we propose the first optimistic model-based algorithm for PbRL with general function approximation, which estimates the model using value-targeted regression and calculates the exploratory policies by solving an optimistic planning problem. Our algorithm achieves the regret of $\tilde{O} (\operatorname{poly}(d H) \sqrt{K} )$, where $d$ is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes, and $\tilde O(\cdot)$ omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the PbRL problem by formulating a novel problem called RL with $n$-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for PbRL with (general) function approximation.
    Logarithmic regret bounds for continuous-time average-reward Markov decision processes. (arXiv:2205.11168v2 [cs.LG] UPDATED)
    We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.
    Deep Neural Network approaches for Analysing Videos of Music Performances. (arXiv:2205.11232v2 [cs.CV] UPDATED)
    This paper presents a framework to automate the labelling process for gestures in musical performance videos with a 3D Convolutional Neural Network (CNN). While this idea was proposed in a previous study, this paper introduces several novelties: (i) Presents a novel method to overcome the class imbalance challenge and make learning possible for co-existent gestures by batch balancing approach and spatial-temporal representations of gestures. (ii) Performs a detailed study on 7 and 18 categories of gestures generated during the performance (guitar play) of musical pieces that have been video-recorded. (iii) Investigates the possibility to use audio features. (iv) Extends the analysis to multiple videos. The novel methods significantly improve the performance of gesture identification by 12 %, when compared to the previous work (51 % in this study over 39 % in previous work). We successfully validate the proposed methods on 7 super classes (72 %), an ensemble of the 18 gestures/classes, and additional videos (75 %).
    GraphMAE: Self-Supervised Masked Graph Autoencoders. (arXiv:2205.10803v2 [cs.LG] UPDATED)
    Self-supervised learning (SSL) has been extensively explored in recent years. Particularly, generative SSL has seen emerging success in natural language processing and other fields, such as the wide adoption of BERT and GPT. Despite this, contrastive learning--which heavily relies on structural data augmentation and complicated training strategies--has been the dominant approach in graph SSL, while the progress of generative SSL on graphs, especially graph autoencoders (GAEs), has thus far not reached the potential as promised in other fields. In this paper, we identify and examine the issues that negatively impact the development of GAEs, including their reconstruction objective, training robustness, and error metric. We present a masked graph autoencoder GraphMAE that mitigates these issues for generative self-supervised graph learning. Instead of reconstructing structures, we propose to focus on feature reconstruction with both a masking strategy and scaled cosine error that benefit the robust training of GraphMAE. We conduct extensive experiments on 21 public datasets for three different graph learning tasks. The results manifest that GraphMAE--a simple graph autoencoder with our careful designs--can consistently generate outperformance over both contrastive and generative state-of-the-art baselines. This study provides an understanding of graph autoencoders and demonstrates the potential of generative self-supervised learning on graphs.
    Bayesian Active Meta-Learning for Black-Box Optimization. (arXiv:2110.09943v2 [cs.LG] CROSS LISTED)
    Data-efficient learning algorithms are essential in many practical applications for which data collection is expensive, e.g., for the optimal deployment of wireless systems in unknown propagation scenarios. Meta-learning can address this problem by leveraging data from a set of related learning tasks, e.g., from similar deployment settings. In practice, one may have available only unlabeled data sets from the related tasks, requiring a costly labeling procedure to be carried out before use in meta-learning. For instance, one may know the possible positions of base stations in a given area, but not the performance indicators achievable with each deployment. To decrease the number of labeling steps required for meta-learning, this paper introduces an information-theoretic active task selection mechanism, and evaluates an instantiation of the approach for Bayesian optimization of black-box models.
    Stack operation of tensor networks. (arXiv:2203.16338v2 [cs.LG] UPDATED)
    The tensor network, as a facterization of tensors, aims at performing the operations that are common for normal tensors, such as addition, contraction and stacking. However, due to its non-unique network structure, only the tensor network contraction is so far well defined. In this paper, we propose a mathematically rigorous definition for the tensor network stack approach, that compress a large amount of tensor networks into a single one without changing their structures and configurations. We illustrate the main ideas with the matrix product states based machine learning as an example. Our results are compared with the for loop and the efficient coding method on both CPU and GPU.
    Towards Practical Physics-Informed ML Design and Evaluation for Power Grid. (arXiv:2205.03673v2 [cs.LG] UPDATED)
    When applied to a real-world safety critical system like the power grid, general machine learning methods suffer from expensive training, non-physical solutions, and limited interpretability. To address these challenges for power grids, many recent works have explored the inclusion of grid physics (i.e., domain expertise) into their method design, primarily through including system constraints and technical limits, reducing search space and defining meaningful features in latent space. Yet, there is no general methodology to evaluate the practicality of these approaches in power grid tasks, and limitations exist regarding scalability, generalization, interpretability, etc. This work formalizes a new concept of physical interpretability which assesses how a ML model makes predictions in a physically meaningful way and introduces an evaluation methodology that identifies a set of attributes that a practical method should satisfy. Inspired by the evaluation attributes, the paper further develops a novel contingency analysis warm starter for MadIoT cyberattack, based on a conditional Gaussian random field. This method serves as an instance of an ML model that can incorporate diverse domain knowledge and improve on these identified attributes. Experiments validate that the warm starter significantly boosts the efficiency of contingency analysis for MadIoT attack even with shallow NN architectures.
    On Understanding and Mitigating the Dimensional Collapse of Graph Contrastive Learning: a Non-Maximum Removal Approach. (arXiv:2203.12821v2 [cs.LG] UPDATED)
    Graph Contrastive Learning (GCL) has shown promising performance in graph representation learning (GRL) without the supervision of manual annotations. GCL can generate graph-level embeddings by maximizing the Mutual Information (MI) between different augmented views of the same graph (positive pairs). However, the GCL is limited by dimensional collapse, i.e., embedding vectors only occupy a low-dimensional subspace. In this paper, we show that the smoothing effect of the graph pooling and the implicit regularization of the graph convolution are two causes of the dimensional collapse in GCL. To mitigate the above issue, we propose a non-maximum removal graph contrastive learning approach (nmrGCL), which removes "prominent'' dimensions (i.e., contribute most in similarity measurement) for positive pair in the pre-text task. Comprehensive experiments on various benchmark datasets are conducted to demonstrate the effectiveness of nmrGCL, and the results show that our model outperforms the state-of-the-art methods. Source code will be made publicly available.
    Unsupervised Ranking and Aggregation of Label Descriptions for Zero-Shot Classifiers. (arXiv:2204.09481v2 [cs.CL] UPDATED)
    Zero-shot text classifiers based on label descriptions embed an input text and a set of labels into the same space: measures such as cosine similarity can then be used to select the most similar label description to the input text as the predicted label. In a true zero-shot setup, designing good label descriptions is challenging because no development set is available. Inspired by the literature on Learning with Disagreements, we look at how probabilistic models of repeated rating analysis can be used for selecting the best label descriptions in an unsupervised fashion. We evaluate our method on a set of diverse datasets and tasks (sentiment, topic and stance). Furthermore, we show that multiple, noisy label descriptions can be aggregated to boost the performance.
    Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets. (arXiv:2202.07511v2 [cs.LG] UPDATED)
    We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.
    Rethinking Attention-Model Explainability through Faithfulness Violation Test. (arXiv:2201.12114v2 [cs.LG] UPDATED)
    Attention mechanisms are dominating the explainability of deep models. They produce probability distributions over the input, which are widely deemed as feature-importance indicators. However, in this paper, we find one critical limitation in attention explanations: weakness in identifying the polarity of feature impact. This would be somehow misleading -- features with higher attention weights may not faithfully contribute to model predictions; instead, they can impose suppression effects. With this finding, we reflect on the explainability of current attention-based techniques, such as Attentio$\odot$Gradient and LRP-based attention explanations. We first propose an actionable diagnostic methodology (henceforth faithfulness violation test) to measure the consistency between explanation weights and the impact polarity. Through the extensive experiments, we then show that most tested explanation methods are unexpectedly hindered by the faithfulness violation issue, especially the raw attention. Empirical analyses on the factors affecting violation issues further provide useful observations for adopting explanation methods in attention models.
    Retrieval-Augmented Reinforcement Learning. (arXiv:2202.08417v4 [cs.LG] UPDATED)
    Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the agent's behavior, and (4) behavior is limited by the capacity of the model. In this paper we explore an alternative paradigm in which we train a network to map a dataset of past experiences to optimal behavior. Specifically, we augment an RL agent with a retrieval process (parameterized as a neural network) that has direct access to a dataset of experiences. This dataset can come from the agent's past experiences, expert demonstrations, or any other relevant source. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context, to help the agent achieve its goal faster and more efficiently. he proposed method facilitates learning agents that at test-time can condition their behavior on the entire dataset and not only the current state, or current trajectory. We integrate our method into two different RL agents: an offline DQN agent and an online R2D2 agent. In offline multi-task problems, we show that the retrieval-augmented DQN agent avoids task interference and learns faster than the baseline DQN agent. On Atari, we show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores. We run extensive ablations to measure the contributions of the components of our proposed method.
    Logical Fallacy Detection. (arXiv:2202.13758v2 [cs.CL] UPDATED)
    Reasoning is central to human intelligence. However, fallacious arguments are common, and some exacerbate problems such as spreading misinformation about climate change. In this paper, we propose the task of logical fallacy detection, and provide a new dataset (Logic) of logical fallacies generally found in text, together with an additional challenge set for detecting logical fallacies in climate change claims (LogicClimate). Detecting logical fallacies is a hard problem as the model must understand the underlying logical structure of the argument. We find that existing pretrained large language models perform poorly on this task. In contrast, we show that a simple structure-aware classifier outperforms the best language model by 5.46% on Logic and 4.51% on LogicClimate. We encourage future work to explore this task as (a) it can serve as a new reasoning challenge for language models, and (b) it can have potential applications in tackling the spread of misinformation. Our dataset and code are available at https://github.com/causalNLP/logical-fallacy.
    MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data. (arXiv:2203.12369v4 [cs.SD] UPDATED)
    Training of speech enhancement systems often does not incorporate knowledge of human perception and thus can lead to unnatural sounding results. Incorporating psychoacoustically motivated speech perception metrics as part of model training via a predictor network has recently gained interest. However, the performance of such predictors is limited by the distribution of metric scores that appear in the training data. In this work, we propose MetricGAN+/- (an extension of MetricGAN+, one such metric-motivated system) which introduces an additional network - a "de-generator" which attempts to improve the robustness of the prediction network (and by extension of the generator) by ensuring observation of a wider range of metric scores in training. Experimental results on the VoiceBank-DEMAND dataset show relative improvement in PESQ score of 3.8% (3.05 vs 3.22 PESQ score), as well as better generalisation to unseen noise and speech.
    Convolutional Neural Networks on Graphs with Chebyshev Approximation, Revisited. (arXiv:2202.03580v3 [cs.LG] UPDATED)
    Designing spectral convolutional networks is a challenging problem in graph learning. ChebNet, one of the early attempts, approximates the spectral graph convolutions using Chebyshev polynomials. GCN simplifies ChebNet by utilizing only the first two Chebyshev polynomials while still outperforming it on real-world datasets. GPR-GNN and BernNet demonstrate that the Monomial and Bernstein bases also outperform the Chebyshev basis in terms of learning the spectral graph convolutions. Such conclusions are counter-intuitive in the field of approximation theory, where it is established that the Chebyshev polynomial achieves the optimum convergent rate for approximating a function. In this paper, we revisit the problem of approximating the spectral graph convolutions with Chebyshev polynomials. We show that ChebNet's inferior performance is primarily due to illegal coefficients learnt by ChebNet approximating analytic filter functions, which leads to over-fitting. We then propose ChebNetII, a new GNN model based on Chebyshev interpolation, which enhances the original Chebyshev polynomial approximation while reducing the Runge phenomenon. We conducted an extensive experimental study to demonstrate that ChebNetII can learn arbitrary graph convolutions and achieve superior performance in both full- and semi-supervised node classification tasks. Most notably, we scale ChebNetII to a billion graph ogbn-papers100M, showing that spectral-based GNNs have superior performance.
    Random Feature Amplification: Feature Learning and Generalization in Neural Networks. (arXiv:2202.07626v2 [cs.LG] UPDATED)
    In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics 'amplify' these weak, random features to strong, useful features.
    Training Differentially Private Models with Secure Multiparty Computation. (arXiv:2202.02625v2 [cs.CR] UPDATED)
    We address the problem of learning a machine learning model from training data that originates at multiple data owners while providing formal privacy guarantees regarding the protection of each owner's data. Existing solutions based on Differential Privacy (DP) achieve this at the cost of a drop in accuracy. Solutions based on Secure Multiparty Computation (MPC) do not incur such accuracy loss but leak information when the trained model is made publicly available. We propose an MPC solution for training DP models. Our solution relies on an MPC protocol for model training, and an MPC protocol for perturbing the trained model coefficients with Laplace noise in a privacy-preserving manner. The resulting MPC+DP approach achieves higher accuracy than a pure DP approach while providing the same formal privacy guarantees. Our work obtained first place in the iDASH2021 Track III competition on confidential computing for secure genome analysis.
    PrivFair: a Library for Privacy-Preserving Fairness Auditing. (arXiv:2202.04058v3 [cs.LG] UPDATED)
    Machine learning (ML) has become prominent in applications that directly affect people's quality of life, including in healthcare, justice, and finance. ML models have been found to exhibit discrimination based on sensitive attributes such as gender, race, or disability. Assessing if an ML model is free of bias remains challenging to date, and by definition has to be done with sensitive user characteristics that are subject of anti-discrimination and data protection law. Existing libraries for fairness auditing of ML models offer no mechanism to protect the privacy of the audit data. We present PrivFair, a library for privacy-preserving fairness audits of ML models. Through the use of Secure Multiparty Computation (MPC), PrivFair protects the confidentiality of the model under audit and the sensitive data used for the audit, hence it supports scenarios in which a proprietary classifier owned by a company is audited using sensitive audit data from an external investigator. We demonstrate the use of PrivFair for group fairness auditing with tabular data or image data, without requiring the investigator to disclose their data to anyone in an unencrypted manner, or the model owner to reveal their model parameters to anyone in plaintext.
    On the identifiability of mixtures of ranking models. (arXiv:2201.13132v2 [cs.LG] UPDATED)
    Mixtures of ranking models are standard tools for ranking problems. However, even the fundamental question of parameter identifiability is not fully understood: the identifiability of a mixture model with two Bradley-Terry-Luce (BTL) components has remained open. In this work, we show that popular mixtures of ranking models with two components (BTL, multinomial logistic models with slates of size 3, or Plackett-Luce) are generically identifiable, i.e., the ground-truth parameters can be identified except when they are from a pathological subset of measure zero. We provide a framework for verifying the number of solutions in a general family of polynomial systems using algebraic geometry, and apply it to these mixtures of ranking models to establish generic identifiability. The framework can be applied more broadly to other learning models and may be of independent interest.
    Identifying Dementia Subtypes with Electronic Health Records. (arXiv:2202.00009v2 [cs.LG] UPDATED)
    Dementia is characterized by a decline in memory and thinking that is significant enough to impair function in activities of daily living. Patients seen in dementia specialty clinics are highly heterogeneous with a variety of different symptoms that progress at different rates. In this work, we used an unsupervised data-driven K-Means clustering approach on the component scores of the Clinical Dementia Rating (CDR) score to identify dementia subtypes and used the gap-statistic to identify the optimal number of clusters. Our goal was to characterize the identified dementia subtypes in terms of their cognitive performance and analyze how patient transitions between subtypes relate to disease progression. Our results indicate both inter-subtype variability, which indicates the variability amongst dementia subtypes for a particular component score even with the same CDR and (ii) intra-subtype variability, which indicates the variation in the 6 component scores within a particular dementia subtype. We observed that dementia subtypes that represented individuals with very mild dementia (CDR 0.5) had widely varying rates of transition to other subtypes. Future work includes testing the generalizability of our proposed pipeline on additional datasets, and using a larger volume of EHR data to estimate probabilistic estimates of the variability between dementia subtypes both in terms of cognitive profile and disease progression.
    Efficient Strong Scaling Through Burst Parallel Training. (arXiv:2112.10065v3 [cs.DC] UPDATED)
    As emerging deep neural network (DNN) models continue to grow in size, using large GPU clusters to train DNNs is becoming an essential requirement to achieving acceptable training times. In this paper, we consider the case where future increases in cluster size will cause the global batch size that can be used to train models to reach a fundamental limit: beyond a certain point, larger global batch sizes cause sample efficiency to degrade, increasing overall time to accuracy. As a result, to achieve further improvements in training performance, we must instead consider "strong scaling" strategies that hold the global batch size constant and allocate smaller batches to each GPU. Unfortunately, this makes it significantly more difficult to use cluster resources efficiently. We present DeepPool, a system that addresses this efficiency challenge through two key ideas. First, burst parallelism allocates large numbers of GPUs to foreground jobs in bursts to exploit the unevenness in parallelism across layers. Second, GPU multiplexing prioritizes throughput for foreground training jobs, while packing in background training jobs to reclaim underutilized GPU resources, thereby improving cluster-wide utilization. Together, these two ideas enable DeepPool to deliver a 1.2 - 2.3x improvement in total cluster throughput over standard data parallelism with a single task when the cluster scale is large.
    Combining optimal path search with task-dependent learning in a neural network. (arXiv:2201.11104v3 [cs.LG] UPDATED)
    Finding optimal paths in connected graphs requires determining the smallest total cost for traveling along the graph's edges. This problem can be solved by several classical algorithms where, usually, costs are predefined for all edges. Conventional planning methods can, thus, normally not be used when wanting to change costs in an adaptive way following the requirements of some task. Here we show that one can define a neural network representation of path finding problems by transforming cost values into synaptic weights, which allows for online weight adaptation using network learning mechanisms. When starting with an initial activity value of one, activity propagation in this network will lead to solutions, which are identical to those found by the Bellman Ford algorithm. The neural network has the same algorithmic complexity as Bellman Ford and, in addition, we can show that network learning mechanisms (such as Hebbian learning) can adapt the weights in the network augmenting the resulting paths according to some task at hand. We demonstrate this by learning to navigate in an environment with obstacles as well as by learning to follow certain sequences of path nodes. Hence, the here-presented novel algorithm may open up a different regime of applications where path-augmentation (by learning) is directly coupled with path finding in a natural way.
    Balanced Graph Structure Learning for Multivariate Time Series Forecasting. (arXiv:2201.09686v2 [cs.LG] UPDATED)
    Accurate forecasting of multivariate time series is an extensively studied subject in finance, transportation, and computer science. Fully mining the correlation and causation between the variables in a multivariate time series exhibits noticeable results in improving the performance of a time series model. Recently, some models have explored the dependencies between variables through end-to-end graph structure learning without the need for predefined graphs. However, current models do not incorporate the trade-off between efficiency and flexibility and lack the guidance of domain knowledge in the design of graph structure learning algorithms. This paper alleviates the above issues by proposing Balanced Graph Structure Learning for Forecasting (BGSLF), a novel deep learning model that joins graph structure learning and forecasting. Technically, BGSLF leverages the spatial information into convolutional operations and extracts temporal dynamics using the diffusion convolutional recurrent network. The proposed framework balance the trade-off between efficiency and flexibility by introducing Multi-Graph Generation Network (MGN) and Graph Selection Module. In addition, a method named Smooth Sparse Unit (SSU) is designed to sparse the learned graph structures, which conforms to the sparse spatial correlations in the real world. Extensive experiments on four real-world datasets demonstrate that our model achieves state-of-the-art performances with minor trainable parameters. Code will be made publicly available.
    Stochastic Neural Networks with Infinite Width are Deterministic. (arXiv:2201.12724v2 [cs.LG] UPDATED)
    This work theoretically studies stochastic neural networks, a main type of neural network in use. We prove that as the width of an optimized stochastic neural network tends to infinity, its predictive variance on the training set decreases to zero. Our theory justifies the common intuition that adding stochasticity to the model can help regularize the model by introducing an averaging effect. Two common examples that our theory can be relevant to are neural networks with dropout and Bayesian latent variable models in a special limit. Our result thus helps better understand how stochasticity affects the learning of neural networks and potentially design better architectures for practical problems.
    PowerGraph: Using neural networks and principal components to determine multivariate statistical power trade-offs. (arXiv:2201.00719v2 [stat.ME] UPDATED)
    Statistical power estimation for studies with multiple model parameters is inherently a multivariate problem. Power for individual parameters of interest cannot be reliably estimated univariately since correlation and variance explained relative to one parameter will impact the power for another parameter, all usual univariate considerations being equal. Explicit solutions in such cases, especially for models with many parameters, are either impractical or impossible to solve, leaving researchers to the prevailing method of simulating power. However, the point estimates for a vector of model parameters are uncertain, and the impact of inaccuracy is unknown. In such cases, sensitivity analysis is recommended such that multiple combinations of possible observable parameter vectors are simulated to understand power trade-offs. A limitation to this approach is that it is computationally expensive to generate sufficient sensitivity combinations to accurately map the power trade-off function in increasingly high-dimensional spaces for the models that social scientists estimate. This paper explores the efficient estimation and graphing of statistical power for a study over varying model parameter combinations. We propose a simple and generalizable machine learning inspired solution to cut the computational cost to less than 10% of the brute force method while providing F1 scores above 90%. We further motivate the impact of transfer learning in learning power manifolds across varying distributions.
    Bayesian Calibration of imperfect computer models using Physics-informed priors. (arXiv:2201.06463v2 [stat.ML] UPDATED)
    We introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters and model formulation of computer models, represented by differential equations. We construct physics-informed priors, which are multi-output GP priors that encode the model's structure in the covariance function. We extend this into a fully Bayesian framework that quantifies the uncertainty of physical parameters and model predictions. Since physical models often are imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. To obtain the posterior distributions, we use Hamiltonian Monte Carlo sampling. We demonstrate our approach in a simulation study with hemodynamical models, which are time-dependent differential equations. Data are simulated from a more complex model than our modelling choice, and the aim is to learn physical parameters according to known mathematical connections. To demonstrate the flexibility of our approach, an example using the Heat equation, a space-time dependent differential equation where we consider a case of a biased data-acquisition process is also included. Finally, we fit the hemodynamic model using real data obtained in a medical trial.
    Swift and Sure: Hardness-aware Contrastive Learning for Low-dimensional Knowledge Graph Embeddings. (arXiv:2201.00565v2 [cs.LG] UPDATED)
    Knowledge graph embedding (KGE) has shown great potential in automatic knowledge graph (KG) completion and knowledge-driven tasks. However, recent KGE models suffer from high training cost and large storage space, thus limiting their practicality in real-world applications. To address this challenge, based on the latest findings in the field of Contrastive Learning, we propose a novel KGE training framework called Hardness-aware Low-dimensional Embedding (HaLE). Instead of the traditional Negative Sampling, we design a new loss function based on query sampling that can balance two important training targets, Alignment and Uniformity. Furthermore, we analyze the hardness-aware ability of recent low-dimensional hyperbolic models and propose a lightweight hardness-aware activation mechanism. The experimental results show that in the limited training time, HaLE can effectively improve the performance and training speed of KGE models on five commonly-used datasets. After training just a few minutes, the HaLE-trained models are competitive compared to the state-of-the-art models in both low- and high-dimensional conditions.
    OstrichRL: A Musculoskeletal Ostrich Simulation to Study Bio-mechanical Locomotion. (arXiv:2112.06061v2 [cs.RO] UPDATED)
    Muscle-actuated control is a research topic that spans multiple domains, including biomechanics, neuroscience, reinforcement learning, robotics, and graphics. This type of control is particularly challenging as bodies are often overactuated and dynamics are delayed and non-linear. It is however a very well tested and tuned actuation mechanism that has undergone millions of years of evolution with interesting properties exploiting passive forces and efficient energy storage of muscle-tendon units. To facilitate research on muscle-actuated simulation, we release a 3D musculoskeletal simulation of an ostrich based on the MuJoCo physics engine. The ostrich is one of the fastest bipeds on earth and therefore makes an excellent model for studying muscle-actuated bipedal locomotion. The model is based on CT scans and dissections used to collect actual muscle data, such as insertion sites, lengths, and pennation angles. Along with this model, we also provide a set of reinforcement learning tasks, including reference motion tracking, running, and neck control, used to infer muscle actuation patterns. The reference motion data is based on motion capture clips of various behaviors that we preprocessed and adapted to our model. This paper describes how the model was built and iteratively improved using the tasks. We also evaluate the accuracy of the muscle actuation patterns by comparing them to experimentally collected electromyographic data from locomoting birds. The results demonstrate the need for rich reward signals or regularization techniques to constrain muscle excitations and produce realistic movements. Overall, we believe that this work can provide a useful bridge between fields of research interested in muscle actuation.
    A Review of Indoor Millimeter Wave Device-based Localization and Device-free Sensing Technologies and Applications. (arXiv:2112.05593v2 [cs.NI] UPDATED)
    The commercial availability of low-cost millimeter wave (mmWave) communication and radar devices is starting to improve the penetration of such technologies in consumer markets, paving the way for large-scale and dense deployments in fifth-generation (5G)-and-beyond as well as 6G networks. At the same time, pervasive mmWave access will enable device localization and device-free sensing with unprecedented accuracy, especially with respect to sub-6 GHz commercial-grade devices. This paper surveys the state of the art in device-based localization and device-free sensing using mmWave communication and radar devices, with a focus on indoor deployments. We first overview key concepts about mmWave signal propagation and system design. Then, we provide a detailed account of approaches and algorithms for localization and sensing enabled by mmWaves. We consider several dimensions in our analysis, including the main objectives, techniques, and performance of each work, whether each research reached some degree of implementation, and which hardware platforms were used for this purpose. We conclude by discussing that better algorithms for consumer-grade devices, data fusion methods for dense deployments, as well as an educated application of machine learning methods are promising, relevant and timely research directions.
    Spatio-Temporal Modeling for Flash Memory Channels Using Conditional Generative Nets. (arXiv:2111.10039v2 [eess.SY] UPDATED)
    We propose a data-driven approach to modeling the spatio-temporal characteristics of NAND flash memory read voltages using conditional generative networks. The learned model reconstructs read voltages from an individual memory cell based on the program levels of the cell and its surrounding cells, as well as the specified program/erase (P/E) cycling time stamp. We evaluate the model over a range of time stamps using the cell read voltage distributions, the cell level error rates, and the relative frequency of errors for patterns most susceptible to inter-cell interference (ICI) effects. We conclude that the model accurately captures the spatial and temporal features of the flash memory channel.
    Ensemble of Averages: Improving Model Selection and Boosting Performance in Domain Generalization. (arXiv:2110.10832v3 [cs.LG] UPDATED)
    In Domain Generalization (DG) settings, models trained independently on a given set of training domains have notoriously chaotic performance on distribution shifted test domains, and stochasticity in optimization (e.g. seed) plays a big role. This makes deep learning models unreliable in real world settings. We first show that this chaotic behavior exists even along the training optimization trajectory of a single model, and propose a simple model averaging protocol that both significantly boosts domain generalization and diminishes the impact of stochasticity by improving the rank correlation between the in-domain validation accuracy and out-domain test accuracy, which is crucial for reliable early stopping. Taking advantage of our observation, we show that instead of ensembling unaveraged models (that is typical in practice), ensembling moving average models (EoA) from independent runs further boosts performance. We theoretically explain the boost in performance of ensembling and model averaging by adapting the well known Bias-Variance trade-off to the domain generalization setting. On the DomainBed benchmark, when using a pre-trained ResNet-50, this ensemble of averages achieves an average of $68.0\%$, beating vanilla ERM (w/o averaging/ensembling) by $\sim 4\%$, and when using a pre-trained RegNetY-16GF, achieves an average of $76.6\%$, beating vanilla ERM by $6\%$. Our code is available at \url{https://github.com/salesforce/ensemble-of-averages}.
    DIGRAC: Digraph Clustering Based on Flow Imbalance. (arXiv:2106.05194v6 [stat.ML] UPDATED)
    Node clustering is a powerful tool in the analysis of networks. We introduce a graph neural network framework to obtain node embeddings for directed networks in a self-supervised manner, including a novel probabilistic imbalance loss, which can be used for network clustering. Here, we propose directed flow imbalance measures, which are tightly related to directionality, to reveal clusters in the network even when there is no density difference between clusters. In contrast to standard approaches in the literature, in this paper, directionality is not treated as a nuisance, but rather contains the main signal. DIGRAC optimizes directed flow imbalance for clustering without requiring label supervision, unlike existing graph neural network methods, and can naturally incorporate node features, unlike existing spectral methods. Extensive experimental results on synthetic data, in the form of directed stochastic block models, and real-world data at different scales, demonstrate that our method, based on flow imbalance, attains state-of-the-art results on directed graph clustering when compared against 10 state-of-the-art methods from the literature, for a wide range of noise and sparsity levels, graph structures and topologies, and even outperforms supervised methods.
    Joint Embedding of Structural and Functional Brain Networks with Graph Neural Networks for Mental Illness Diagnosis. (arXiv:2107.03220v2 [q-bio.NC] UPDATED)
    Multimodal brain networks characterize complex connectivities among different brain regions from both structural and functional aspects and provide a new means for mental disease analysis. Recently, Graph Neural Networks (GNNs) have become a de facto model for analyzing graph-structured data. However, how to employ GNNs to extract effective representations from brain networks in multiple modalities remains rarely explored. Moreover, as brain networks provide no initial node features, how to design informative node attributes and leverage edge weights for GNNs to learn is left unsolved. To this end, we develop a novel multiview GNN for multimodal brain networks. In particular, we regard each modality as a view for brain networks and employ contrastive learning for multimodal fusion. Then, we propose a GNN model which takes advantage of the message passing scheme by propagating messages based on degree statistics and brain region connectivities. Extensive experiments on two real-world disease datasets (HIV and Bipolar) demonstrate the effectiveness of our proposed method over state-of-the-art baselines.
    Nonnegative Tensor Completion via Integer Optimization. (arXiv:2111.04580v2 [cs.LG] UPDATED)
    Unlike matrix completion, tensor completion does not have an algorithm that is known to achieve the information-theoretic sample complexity rate. This paper develops a new algorithm for the special case of completion for nonnegative tensors. We prove that our algorithm converges in a linear (in numerical tolerance) number of oracle steps, while achieving the information-theoretic rate. Our approach is to define a new norm for nonnegative tensors using the gauge of a particular 0-1 polytope; integer linear programming can, in turn, be used to solve linear separation problems over this polytope. We combine this insight with a variant of the Frank-Wolfe algorithm to construct our numerical algorithm, and we demonstrate its effectiveness and scalability through computational experiments using a laptop on tensors with up to one-hundred million entries.
    Learning Deep Representation with Energy-Based Self-Expressiveness for Subspace Clustering. (arXiv:2110.15037v2 [cs.LG] UPDATED)
    Deep subspace clustering has attracted increasing attention in recent years. Almost all the existing works are required to load the whole training data into one batch for learning the self-expressive coefficients in the framework of deep learning. Although these methods achieve promising results, such a learning fashion severely prevents from the usage of deeper neural network architectures (e.g., ResNet), leading to the limited representation abilities of the models. In this paper, we propose a new deep subspace clustering framework, motivated by the energy-based models. In contrast to previous approaches taking the weights of a fully connected layer as the self-expressive coefficients, we propose to learn an energy-based network to obtain the self-expressive coefficients by mini-batch training. By this means, it is no longer necessary to load all data into one batch for learning, and it thus becomes a reality that we can utilize deeper neural network models for subspace clustering. Considering the powerful representation ability of the recently popular self-supervised learning, we attempt to leverage self-supervised representation learning to learn the dictionary. Finally, we propose a joint framework to learn both the self-expressive coefficients and dictionary simultaneously, and train the model in an end-to-end manner. The experiments are performed on three publicly available datasets, and extensive experimental results demonstrate our method can significantly outperform the other related approaches. For instance, on the three datasets, our method can averagely achieve $13.8\%$, $15.4\%$, $20.8\%$ improvements in terms of Accuracy, NMI, and ARI over SENet which is proposed very recently and obtains the second best results in the experiments.
    Bellman-consistent Pessimism for Offline Reinforcement Learning. (arXiv:2106.06926v5 [cs.LG] UPDATED)
    The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.
    Approximated Multi-Agent Fitted Q Iteration. (arXiv:2104.09343v4 [cs.LG] UPDATED)
    We formulate an efficient approximation for multi-agent batch reinforcement learning, the approximated multi-agent fitted Q iteration (AMAFQI). We present a detailed derivation of our approach. We propose an iterative policy search and show that it yields a greedy policy with respect to multiple approximations of the centralized, learned Q-function. In each iteration and policy evaluation, AMAFQI requires a number of computations that scales linearly with the number of agents whereas the analogous number of computations increase exponentially for the fitted Q iteration (FQI), a commonly used approaches in batch reinforcement learning. This property of AMAFQI is fundamental for the design of a tractable multi-agent approach. We evaluate the performance of AMAFQI and compare it to FQI in numerical simulations. The simulations illustrate the significant computation time reduction when using AMAFQI instead of FQI in multi-agent problems and corroborate the similar performance of both approaches.
    SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. (arXiv:2110.07205v3 [eess.AS] UPDATED)
    Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We release our code and model at https://github.com/microsoft/SpeechT5.
    Differentiable Architecture Search for Reinforcement Learning. (arXiv:2106.02229v3 [cs.LG] UPDATED)
    In this paper, we investigate the fundamental question: To what extent are gradient-based neural architecture search (NAS) techniques applicable to RL? Using the original DARTS as a convenient baseline, we discover that the discrete architectures found can achieve up to 250% performance compared to manual architecture designs on both discrete and continuous action space environments across off-policy and on-policy RL algorithms, at only 3x more computation time. Furthermore, through numerous ablation studies, we systematically verify that not only does DARTS correctly upweight operations during its supernet phrase, but also gradually improves resulting discrete cells up to 30x more efficiently than random search, suggesting DARTS is surprisingly an effective tool for improving architectures in RL.
    A multi-stage machine learning model on diagnosis of esophageal manometry. (arXiv:2106.13869v2 [cs.LG] UPDATED)
    High-resolution manometry (HRM) is the primary procedure used to diagnose esophageal motility disorders. Its interpretation and classification includes an initial evaluation of swallow-level outcomes and then derivation of a study-level diagnosis based on Chicago Classification (CC), using a tree-like algorithm. This diagnostic approach on motility disordered using HRM was mirrored using a multi-stage modeling framework developed using a combination of various machine learning approaches. Specifically, the framework includes deep-learning models at the swallow-level stage and feature-based machine learning models at the study-level stage. In the swallow-level stage, three models based on convolutional neural networks (CNNs) were developed to predict swallow type, swallow pressurization, and integrated relaxation pressure (IRP). At the study-level stage, model selection from families of the expert-knowledge-based rule models, xgboost models and artificial neural network(ANN) models were conducted, with the latter two model designed and augmented with motivation from the export knowledge. A simple model-agnostic strategy of model balancing motivated by Bayesian principles was utilized, which gave rise to model averaging weighted by precision scores. The averaged (blended) models and individual models were compared and evaluated, of which the best performance on test dataset is 0.81 in top-1 prediction, 0.92 in top-2 predictions. This is the first artificial-intelligence-style model to automatically predict CC diagnosis of HRM study from raw multi-swallow data. Moreover, the proposed modeling framework could be easily extended to multi-modal tasks, such as diagnosis of esophageal patients based on clinical data from both HRM and functional luminal imaging probe panometry (FLIP).
    Learning Time-Varying Graphs from Online Data. (arXiv:2110.11017v2 [cs.LG] UPDATED)
    This work proposes an algorithmic framework to learn time-varying graphs from online data. The generality offered by the framework renders it model-independent, i.e., it can be theoretically analyzed in its abstract formulation and then instantiated under a variety of model-dependent graph learning problems. This is possible by phrasing (time-varying) graph learning as a composite optimization problem, where different functions regulate different desiderata, e.g., data fidelity, sparsity or smoothness. Instrumental for the findings is recognizing that the dependence of the majority (if not all) data-driven graph learning algorithms on the data is exerted through the empirical covariance matrix, representing a sufficient statistic for the estimation problem. Its user-defined recursive update enables the framework to work in non-stationary environments, while iterative algorithms building on novel time-varying optimization tools explicitly take into account the temporal dynamics, speeding up convergence and implicitly including a temporal-regularization of the solution. We specialize the framework to three well-known graph learning models, namely, the Gaussian graphical model (GGM), the structural equation model (SEM), and the smoothness-based model (SBM), where we also introduce ad-hoc vectorization schemes for structured matrices (symmetric, hollows, etc.) which are crucial to perform correct gradient computations, other than enabling to work in low-dimensional vector spaces and hence easing storage requirements. After discussing the theoretical guarantees of the proposed framework, we corroborate it with extensive numerical tests in synthetic and real data.
    Generalizing to Unseen Domains: A Survey on Domain Generalization. (arXiv:2103.03097v7 [cs.LG] UPDATED)
    Machine learning systems generally assume that the training and testing distributions are the same. To this end, a key requirement is to develop models that can generalize to unseen distributions. Domain generalization (DG), i.e., out-of-distribution generalization, has attracted increasing interests in recent years. Domain generalization deals with a challenging setting where one or several different but related domain(s) are given, and the goal is to learn a model that can generalize to an unseen test domain. Great progress has been made in the area of domain generalization for years. This paper presents the first review of recent advances in this area. First, we provide a formal definition of domain generalization and discuss several related fields. We then thoroughly review the theories related to domain generalization and carefully analyze the theory behind generalization. We categorize recent algorithms into three classes: data manipulation, representation learning, and learning strategy, and present several popular algorithms in detail for each category. Third, we introduce the commonly used datasets, applications, and our open-sourced codebase for fair evaluation. Finally, we summarize existing literature and present some potential research topics for the future.
    Out-of-Distribution Dynamics Detection: RL-Relevant Benchmarks and Results. (arXiv:2107.04982v2 [cs.LG] UPDATED)
    We study the problem of out-of-distribution dynamics (OODD) detection, which involves detecting when the dynamics of a temporal process change compared to the training-distribution dynamics. This is relevant to applications in control, reinforcement learning (RL), and multi-variate time-series, where changes to test time dynamics can impact the performance of learning controllers/predictors in unknown ways. This problem is particularly important in the context of deep RL, where learned controllers often overfit to the training environment. Currently, however, there is a lack of established OODD benchmarks for the types of environments commonly used in RL research. Our first contribution is to design a set of OODD benchmarks derived from common RL environments with varying types and intensities of OODD. Our second contribution is to design a strong OODD baseline approach based on recurrent implicit quantile network (RIQN), which monitors autoregressive prediction errors for OODD detection. In addition to RIQN, we introduce and test three other simpler baselines. Our final contribution is to evaluate our baseline approaches on the benchmarks to provide results for future comparison.
    Towards Continual Knowledge Learning of Language Models. (arXiv:2110.03215v4 [cs.CL] UPDATED)
    Large Language Models (LMs) are known to encode world knowledge in their parameters as they pretrain on a vast amount of web corpus, which is often utilized for performing knowledge-dependent downstream tasks such as question answering, fact-checking, and open dialogue. In real-world scenarios, the world knowledge stored in the LMs can quickly become outdated as the world changes, but it is non-trivial to avoid catastrophic forgetting and reliably acquire new knowledge while preserving invariant knowledge. To push the community towards better maintenance of ever-changing LMs, we formulate a new continual learning (CL) problem called Continual Knowledge Learning (CKL). We construct a new benchmark and metric to quantify the retention of time-invariant world knowledge, the update of outdated knowledge, and the acquisition of new knowledge. We adopt applicable recent methods from literature to create several strong baselines. Through extensive experiments, we find that CKL exhibits unique challenges that are not addressed in previous CL setups, where parameter expansion is necessary to reliably retain and learn knowledge simultaneously. By highlighting the critical causes of knowledge forgetting, we show that CKL is a challenging and important problem that helps us better understand and train ever-changing LMs. The benchmark datasets, evaluation script, and baseline code to reproduce our results are available at https://github.com/joeljang/continual-knowledge-learning.
    Recent Advances on Neural Network Pruning at Initialization. (arXiv:2103.06460v3 [cs.LG] UPDATED)
    Neural network pruning typically removes connections or neurons from a pretrained converged model; while a new pruning paradigm, pruning at initialization (PaI), attempts to prune a randomly initialized network. This paper offers the first survey concentrated on this emerging pruning fashion. We first introduce a generic formulation of neural network pruning, followed by the major classic pruning topics. Then, as the main body of this paper, a thorough and structured literature review of PaI methods is presented, consisting of two major tracks (sparse training and sparse selection). Finally, we summarize the surge of PaI compared to PaT and discuss the open problems. Apart from the dedicated literature review, this paper also offers a code base for easy sanity-checking and benchmarking of different PaI methods.
    Quasi-Equivalence of Width and Depth of Neural Networks. (arXiv:2002.02515v7 [cs.LG] UPDATED)
    While classic studies proved that wide networks allow universal approximation, recent research and successes of deep learning demonstrate the power of deep networks. Based on a symmetric consideration, we investigate if the design of artificial neural networks should have a directional preference, and what the mechanism of interaction is between the width and depth of a network. Inspired by the De Morgan law, we address this fundamental question by establishing a quasi-equivalence between the width and depth of ReLU networks in two aspects. First, we formulate two transforms for mapping an arbitrary ReLU network to a wide network and a deep network respectively for either regression or classification so that the essentially same capability of the original network can be implemented. Then, we replace the mainstream artificial neuron type with a quadratic counterpart, and utilize the factorization and continued fraction representations of the same polynomial function to construct a wide network and a deep network, respectively. Based on our findings, a deep network has a wide equivalent, and vice versa, subject to an arbitrarily small error.
    How much human-like visual experience do current self-supervised learning algorithms need in order to achieve human-level object recognition?. (arXiv:2109.11523v3 [cs.CV] UPDATED)
    This paper addresses a fundamental question: how good are our current self-supervised visual representation learning algorithms relative to humans? More concretely, how much "human-like" natural visual experience would these algorithms need in order to reach human-level performance in a complex, realistic visual object recognition task such as ImageNet? Using a scaling experiment, here we estimate that the answer is several orders of magnitude longer than a human lifetime: typically on the order of a million to a billion years of natural visual experience (depending on the algorithm used). We obtain even larger estimates for achieving human-level performance in ImageNet-derived robustness benchmarks. The exact values of these estimates are sensitive to some underlying assumptions, however even in the most optimistic scenarios they remain orders of magnitude larger than a human lifetime. We discuss the main caveats surrounding our estimates and the implications of these surprising results.
    Towards A Measure Of General Machine Intelligence. (arXiv:2109.12075v4 [cs.AI] UPDATED)
    To build general-purpose artificial intelligence systems that can deal with unknown variables across unknown domains, we need benchmarks that measure how well these systems perform on tasks they have never seen before. A prerequisite for this is a measure of a task's generalization difficulty, or how dissimilar it is from the system's prior knowledge and experience. If the skill of an intelligence system in a particular domain is defined as it's ability to consistently generate a set of instructions (or programs) to solve tasks in that domain, current benchmarks do not quantitatively measure the efficiency of acquiring new skills, making it possible to brute-force skill acquisition by training with unlimited amounts of data and compute power. With this in mind, we first propose a common language of instruction, a programming language that allows the expression of programs in the form of directed acyclic graphs across a wide variety of real-world domains and computing platforms. Using programs generated in this language, we demonstrate a match-based method to both score performance and calculate the generalization difficulty of any given set of tasks. We use these to define a numeric benchmark called the generalization index, or the g-index, to measure and compare the skill-acquisition efficiency of any intelligence system on a set of real-world tasks. Finally, we evaluate the suitability of some well-known models as general intelligence systems by calculating their g-index scores.
    3D Infomax improves GNNs for Molecular Property Prediction. (arXiv:2110.04126v3 [cs.LG] UPDATED)
    Molecular property prediction is one of the fastest-growing applications of deep learning with critical real-world impacts. Including 3D molecular structure as input to learned models improves their performance for many molecular tasks. However, this information is infeasible to compute at the scale required by several real-world applications. We propose pre-training a model to reason about the geometry of molecules given only their 2D molecular graphs. Using methods from self-supervised learning, we maximize the mutual information between 3D summary vectors and the representations of a Graph Neural Network (GNN) such that they contain latent 3D information. During fine-tuning on molecules with unknown geometry, the GNN still generates implicit 3D information and can use it to improve downstream tasks. We show that 3D pre-training provides significant improvements for a wide range of properties, such as a 22% average MAE reduction on eight quantum mechanical properties. Moreover, the learned representations can be effectively transferred between datasets in different molecular spaces.
    Factor Analysis, Probabilistic Principal Component Analysis, Variational Inference, and Variational Autoencoder: Tutorial and Survey. (arXiv:2101.00734v2 [stat.ML] UPDATED)
    This is a tutorial and survey paper on factor analysis, probabilistic Principal Component Analysis (PCA), variational inference, and Variational Autoencoder (VAE). These methods, which are tightly related, are dimensionality reduction and generative models. They assume that every data point is generated from or caused by a low-dimensional latent factor. By learning the parameters of distribution of latent space, the corresponding low-dimensional factors are found for the sake of dimensionality reduction. For their stochastic and generative behaviour, these models can also be used for generation of new data points in the data space. In this paper, we first start with variational inference where we derive the Evidence Lower Bound (ELBO) and Expectation Maximization (EM) for learning the parameters. Then, we introduce factor analysis, derive its joint and marginal distributions, and work out its EM steps. Probabilistic PCA is then explained, as a special case of factor analysis, and its closed-form solutions are derived. Finally, VAE is explained where the encoder, decoder and sampling from the latent space are introduced. Training VAE using both EM and backpropagation are explained.
    FedTune: Automatic Tuning of Federated Learning Hyper-Parameters from System Perspective. (arXiv:2110.03061v4 [cs.LG] UPDATED)
    Federated learning (FL) hyper-parameters significantly affect the training overheads in terms of computation time, transmission time, computation load, and transmission load. However, the current practice of manually selecting FL hyper-parameters puts a high burden on FL practitioners since various applications prefer different training preferences. In this paper, we propose FedTune, an automatic FL hyper-parameter tuning algorithm tailored to applications' diverse system requirements of FL training. FedTune is lightweight and flexible, achieving 8.48%-26.75% improvement for different datasets compared to fixed FL hyper-parameters.
    Weakly-supervised Multi-output Regression via Correlated Gaussian Processes. (arXiv:2002.08412v2 [stat.ML] UPDATED)
    Multi-output regression seeks to borrow strength and leverage commonalities across different but related outputs in order to enhance learning and prediction accuracy. A fundamental assumption is that the output/group membership labels for all observations are known. This assumption is often violated in real applications. For instance, in healthcare datasets, sensitive attributes such as ethnicity are often missing or unreported. To this end, we introduce a weakly-supervised multi-output model based on dependent Gaussian processes. Our approach is able to leverage data without complete group labels or possibly only prior belief on group memberships to enhance accuracy across all outputs. Through intensive simulations and case studies on an Insulin, Testosterone and Bodyfat dataset, we show that our model excels in multi-output settings with missing labels, while being competitive in traditional fully labeled settings. We end by highlighting the possible use of our approach in fair inference and sequential decision-making.
    Effects of Image Size on Deep Learning. (arXiv:2101.11508v3 [cs.CV] UPDATED)
    This paper presents the evaluation of the effects of image size on deep learning performance via semantic segmentation of magnetic resonance heart images with U-net for fully automated quantification of myocardial infarction. Both non-extra pixel and extra pixel interpolation algorithms are used to change the size of images in datasets of interest. Extra class labels, in interpolated ground truth segmentation images, are removed using thresholding, median filtering, and subtraction strategies. Common class metrics are used to evaluate the quality of semantic segmentation with U-net against the ground truth segmentation while arbitrary threshold, comparison of the sums, and sums of differences between medical experts and fully automated results are options used to estimate the relationship between medical experts-based quantification and fully automated quantification results.
    MAGMA: Inference and Prediction with Multi-Task Gaussian Processes. (arXiv:2007.10731v2 [stat.CO] UPDATED)
    A novel multi-task Gaussian process (GP) framework is proposed, by using a common mean process for sharing information across tasks. In particular, we investigate the problem of time series forecasting, with the objective to improve multiple-step-ahead predictions. The common mean process is defined as a GP for which the hyper-posterior distribution is tractable. Therefore an EM algorithm is derived for handling both hyper-parameters optimisation and hyper-posterior computation. Unlike previous approaches in the literature, the model fully accounts for uncertainty and can handle irregular grids of observations while maintaining explicit formulations, by modelling the mean process in a unified GP framework. Predictive analytical equations are provided, integrating information shared across tasks through a relevant prior mean. This approach greatly improves the predictive performances, even far from observations, and may reduce significantly the computational complexity compared to traditional multi-task GP models. Our overall algorithm is called \textsc{Magma} (standing for Multi tAsk Gaussian processes with common MeAn). The quality of the mean process estimation, predictive performances, and comparisons to alternatives are assessed in various simulated scenarios and on real datasets.
    RevUp: Revise and Update Information Bottleneck for Event Representation. (arXiv:2205.12248v1 [cs.LG])
    In machine learning, latent variables play a key role to capture the underlying structure of data, but they are often unsupervised. When we have side knowledge that already has high-level information about the input data, we can use that source to guide latent variables and capture the available background information in a process called "parameter injection." In that regard, we propose a semi-supervised information bottleneck-based model that enables the use of side knowledge, even if it is noisy and imperfect, to direct the learning of discrete latent variables. Fundamentally, we introduce an auxiliary continuous latent variable as a way to reparameterize the model's discrete variables with a light-weight hierarchical structure. With this reparameterization, the model's discrete latent variables are learned to minimize the mutual information between the observed data and optional side knowledge that is not already captured by the new, auxiliary variables. We theoretically show that our approach generalizes an existing method of parameter injection, and perform an empirical case study of our approach on language-based event modeling. We corroborate our theoretical results with strong empirical experiments, showing that the proposed method outperforms previous proposed approaches on multiple datasets.
    DeepKriging: Spatially Dependent Deep Neural Networks for Spatial Prediction. (arXiv:2007.11972v4 [stat.ML] UPDATED)
    In spatial statistics, a common objective is to predict values of a spatial process at unobserved locations by exploiting spatial dependence. Kriging provides the best linear unbiased predictor using covariance functions and is often associated with Gaussian processes. However, when considering non-linear prediction for non-Gaussian and categorical data, the Kriging prediction is no longer optimal, and the associated variance is often overly optimistic. Although deep neural networks (DNNs) are widely used for general classification and prediction, they have not been studied thoroughly for data with spatial dependence. In this work, we propose a novel DNN structure for spatial prediction, where the spatial dependence is captured by adding an embedding layer of spatial coordinates with basis functions. We show in theory and simulation studies that the proposed DeepKriging method has a direct link to Kriging in the Gaussian case, and it has multiple advantages over Kriging for non-Gaussian and non-stationary data, i.e., it provides non-linear predictions and thus has smaller approximation errors, it does not require operations on covariance matrices and thus is scalable for large datasets, and with sufficiently many hidden neurons, it provides the optimal prediction in terms of model capacity. We further explore the possibility of quantifying prediction uncertainties based on density prediction without assuming any data distribution. Finally, we apply the method to predicting PM2.5 concentrations across the continental United States.
    Risk-Sensitive Reinforcement Learning via Policy Gradient Search. (arXiv:1810.09126v3 [cs.LG] UPDATED)
    The objective in a traditional reinforcement learning (RL) problem is to find a policy that optimizes the expected value of a performance metric such as the infinite-horizon cumulative discounted or long-run average cost/reward. In practice, optimizing the expected value alone may not be satisfactory, in that it may be desirable to incorporate the notion of risk into the optimization problem formulation, either in the objective or as a constraint. Various risk measures have been proposed in the literature, e.g., exponential utility, variance, percentile performance, chance constraints, value at risk (quantile), conditional value-at-risk, prospect theory and its later enhancement, cumulative prospect theory. In this book, we consider risk-sensitive RL in two settings: one where the goal is to find a policy that optimizes the usual expected value objective while ensuring that a risk constraint is satisfied, and the other where the risk measure is the objective. We survey some of the recent work in this area specifically where policy gradient search is the solution approach. In the first risk-sensitive RL setting, we cover popular risk measures based on variance, conditional value-at-risk, and chance constraints, and present a template for policy gradient-based risk-sensitive RL algorithms using a Lagrangian formulation. For the setting where risk is incorporated directly into the objective function, we consider an exponential utility formulation, cumulative prospect theory, and coherent risk measures. This non-exhaustive survey aims to give a flavor of the challenges involved in solving risk-sensitive RL problems using policy gradient methods, as well as outlining some potential future research directions.
    DoorGym: A Scalable Door Opening Environment And Baseline Agent. (arXiv:1908.01887v4 [cs.RO] UPDATED)
    In order to practically implement the door opening task, a policy ought to be robust to a wide distribution of door types and environment settings. Reinforcement Learning (RL) with Domain Randomization (DR) is a promising technique to enforce policy generalization, however, there are only a few accessible training environments that are inherently designed to train agents in domain randomized environments. We introduce DoorGym, an open-source door opening simulation framework designed to utilize domain randomization to train a stable policy. We intend for our environment to lie at the intersection of domain transfer, practical tasks, and realism. We also provide baseline Proximal Policy Optimization and Soft Actor-Critic implementations, which achieves success rates between 0% up to 95% for opening various types of doors in this environment. Moreover, the real-world transfer experiment shows the trained policy is able to work in the real world. Environment kit available here: https://github.com/PSVL/DoorGym/
    Modular Meta-Learning for Power Control via Random Edge Graph Neural Networks. (arXiv:2108.13178v2 [cs.NI] CROSS LISTED)
    In this paper, we consider the problem of power control for a wireless network with an arbitrarily time-varying topology, including the possible addition or removal of nodes. A data-driven design methodology that leverages graph neural networks (GNNs) is adopted in order to efficiently parametrize the power control policy mapping the channel state information (CSI) to transmit powers. The specific GNN architecture, known as random edge GNN (REGNN), defines a non-linear graph convolutional filter whose spatial weights are tied to the channel coefficients. While prior work assumed a joint training approach whereby the REGNN-based policy is shared across all topologies, this paper targets adaptation of the power control policy based on limited CSI data regarding the current topology. To this end, we propose a novel modular meta-learning technique that enables the efficient optimization of module assignment. While black-box meta-learning optimizes a general-purpose adaptation procedure via (stochastic) gradient descent, modular meta-learning finds a set of reusable modules that can form components of a solution for any new network topology. Numerical results validate the benefits of meta-learning for power control problems over joint training schemes, and demonstrate the advantages of modular meta-learning when data availability is extremely limited.
    EBM Life Cycle: MCMC Strategies for Synthesis, Defense, and Density Modeling. (arXiv:2205.12243v1 [stat.ML])
    This work presents strategies to learn an Energy-Based Model (EBM) according to the desired length of its MCMC sampling trajectories. MCMC trajectories of different lengths correspond to models with different purposes. Our experiments cover three different trajectory magnitudes and learning outcomes: 1) shortrun sampling for image generation; 2) midrun sampling for classifier-agnostic adversarial defense; and 3) longrun sampling for principled modeling of image probability densities. To achieve these outcomes, we introduce three novel methods of MCMC initialization for negative samples used in Maximum Likelihood (ML) learning. With standard network architectures and an unaltered ML objective, our MCMC initialization methods alone enable significant performance gains across the three applications that we investigate. Our results include state-of-the-art FID scores for unnormalized image densities on the CIFAR-10 and ImageNet datasets; state-of-the-art adversarial defense on CIFAR-10 among purification methods and the first EBM defense on ImageNet; and scalable techniques for learning valid probability densities. Code for this project can be found at https://github.com/point0bar1/ebm-life-cycle.
    Taming the sign problem of explicitly antisymmetrized neural networks via rough activation functions. (arXiv:2205.12250v1 [cs.LG])
    Explicit antisymmetrization of a two-layer neural network is a potential candidate for a universal function approximator for generic antisymmetric functions, which are ubiquitous in quantum physics. However, this strategy suffers from a sign problem, namely, due to near exact cancellation of positive and negative contributions, the magnitude of the antisymmetrized function may be significantly smaller than that before antisymmetrization. We prove that the severity of the sign problem is directly related to the smoothness of the activation function. For smooth activation functions (e.g., $\tanh$), the sign problem of the explicitly antisymmetrized two-layer neural network deteriorates super-polynomially with respect to the system size. On the other hand, for rough activation functions (e.g., ReLU), the deterioration rate of the sign problem can be tamed to be at most polynomial with respect to the system size. Finally, the cost of a direct implementation of antisymmetrized two-layer neural network scales factorially with respect to the system size. We describe an efficient algorithm for approximate evaluation of such a network, of which the cost scales polynomially with respect to the system size and inverse precision.
    Interpretation Quality Score for Measuring the Quality of interpretability methods. (arXiv:2205.12254v1 [cs.CL])
    Machine learning (ML) models have been applied to a wide range of natural language processing (NLP) tasks in recent years. In addition to making accurate decisions, the necessity of understanding how models make their decisions has become apparent in many applications. To that end, many interpretability methods that help explain the decision processes of ML models have been developed. Yet, there currently exists no widely-accepted metric to evaluate the quality of explanations generated by these methods. As a result, there currently is no standard way of measuring to what degree an interpretability method achieves an intended objective. Moreover, there is no accepted standard of performance by which we can compare and rank the current existing interpretability methods. In this paper, we propose a novel metric for quantifying the quality of explanations generated by interpretability methods. We compute the metric on three NLP tasks using six interpretability methods and present our results.
    Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Query Attacks. (arXiv:2205.12134v1 [cs.LG])
    The score-based query attacks (SQAs) pose practical threats to deep neural networks by crafting adversarial perturbations within dozens of queries, only using the model's output scores. Nonetheless, we note that if the loss trend of the outputs is slightly perturbed, SQAs could be easily misled and thereby become much less effective. Following this idea, we propose a novel defense, namely Adversarial Attack on Attackers (AAA), to confound SQAs towards incorrect attack directions by slightly modifying the output logits. In this way, (1) SQAs are prevented regardless of the model's worst-case robustness; (2) the original model predictions are hardly changed, i.e., no degradation on clean accuracy; (3) the calibration of confidence scores can be improved simultaneously. Extensive experiments are provided to verify the above advantages. For example, by setting $\ell_\infty=8/255$ on CIFAR-10, our proposed AAA helps WideResNet-28 secure $80.59\%$ accuracy under Square attack ($2500$ queries), while the best prior defense (i.e., adversarial training) only attains $67.44\%$. Since AAA attacks SQA's general greedy strategy, such advantages of AAA over 8 defenses can be consistently observed on 8 CIFAR-10/ImageNet models under 6 SQAs, using different attack targets and bounds. Moreover, AAA calibrates better without hurting the accuracy. Our code would be released.
    Asynchronous Neural Networks for Learning in Graphs. (arXiv:2205.12245v1 [cs.LG])
    This paper studies asynchronous message passing (AMP), a new paradigm for applying neural network based learning to graphs. Existing graph neural networks use the synchronous distributed computing model and aggregate their neighbors in each round, which causes problems such as oversmoothing and limits their expressiveness. On the other hand, AMP is based on the asynchronous model, where nodes react to messages of their neighbors individually. We prove that (i) AMP can simulate synchronous GNNs and that (ii) AMP can theoretically distinguish any pair of graphs. We experimentally validate AMP's expressiveness. Further, we show that AMP might be better suited to propagate messages over large distances in graphs and performs well on several graph classification benchmarks.
    One-Pixel Shortcut: on the Learning Preference of Deep Neural Networks. (arXiv:2205.12141v1 [cs.LG])
    Unlearnable examples (ULEs) aim to protect data from unauthorized usage for training DNNs. Error-minimizing noise, which is injected to clean data, is one of the most successful methods for preventing DNNs from giving correct predictions on incoming new data. Nonetheless, under specific training strategies such as adversarial training, the unlearnability of error-minimizing noise will severely degrade. In addition, the transferability of error-minimizing noise is inherently limited by the mismatch between the generator model and the targeted learner model. In this paper, we investigate the mechanism of unlearnable examples and propose a novel model-free method, named \emph{One-Pixel Shortcut}, which only perturbs a single pixel of each image and makes the dataset unlearnable. Our method needs much less computational cost and obtains stronger transferability and thus can protect data from a wide range of different models. Based on this, we further introduce the first unlearnable dataset called CIFAR-10-S, which is indistinguishable from normal CIFAR-10 by human observers and can serve as a benchmark for different models or training strategies to evaluate their abilities to extract critical features from the disturbance of non-semantic representations. The original error-minimizing ULEs will lose efficiency under adversarial training, where the model can get over 83\% clean test accuracy. Meanwhile, even if adversarial training and strong data augmentation like RandAugment are applied together, the model trained on CIFAR-10-S cannot get over 50\% clean test accuracy.
    Gacs-Korner Common Information Variational Autoencoder. (arXiv:2205.12239v1 [cs.LG])
    We propose a notion of common information that allows one to quantify and separate the information that is shared between two random variables from the information that is unique to each. Our notion of common information is a variational relaxation of the G\'acs-K\"orner common information, which we recover as a special case, but is more amenable to optimization and can be approximated empirically using samples from the underlying distribution. We then provide a method to partition and quantify the common and unique information using a simple modification of a traditional variational auto-encoder. Empirically, we demonstrate that our formulation allows us to learn semantically meaningful common and unique factors of variation even on high-dimensional data such as images and videos. Moreover, on datasets where ground-truth latent factors are known, we show that we can accurately quantify the common information between the random variables. Additionally, we show that the auto-encoder that we learn recovers semantically meaningful disentangled factors of variation, even though we do not explicitly optimize for it.
    Federated singular value decomposition for high dimensional data. (arXiv:2205.12109v1 [cs.LG])
    Federated learning (FL) is emerging as a privacy-aware alternative to classical cloud-based machine learning. In FL, the sensitive data remains in data silos and only aggregated parameters are exchanged. Hospitals and research institutions which are not willing to share their data can join a federated study without breaching confidentiality. In addition to the extreme sensitivity of biomedical data, the high dimensionality poses a challenge in the context of federated genome-wide association studies (GWAS). In this article, we present a federated singular value decomposition (SVD) algorithm, suitable for the privacy-related and computational requirements of GWAS. Notably, the algorithm has a transmission cost independent of the number of samples and is only weakly dependent on the number of features, because the singular vectors associated with the samples are never exchanged and the vectors associated with the features only for a fixed number of iterations. Although motivated by GWAS, the algorithm is generically applicable for both horizontally and vertically partitioned data.
    Rethinking Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization. (arXiv:2205.12191v1 [cs.CL])
    Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, we observe that these models exhibit poor out-of-distribution (OOD) generalization on the task of VQA. To better understand the underlying causes of poor generalization, we comprehensively investigate performance of two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also argue that in most cases generative models are less susceptible to shifts in data distribution, while frequently performing better on our tested benchmarks. Moreover, we find that multimodal pretraining improves OOD performance in most settings. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.
    Individual Topology Structure of Eye Movement Trajectories. (arXiv:2205.10667v2 [cs.CV] UPDATED)
    Traditionally, extracting patterns from eye movement data relies on statistics of different macro-events such as fixations and saccades. This requires an additional preprocessing step to separate the eye movement subtypes, often with a number of parameters on which the classification results depend. Besides that, definitions of such macro events are formulated in different ways by different researchers. We propose an application of a new class of features to the quantitative analysis of personal eye movement trajectories structure. This new class of features based on algebraic topology allows extracting patterns from different modalities of gaze such as time series of coordinates and amplitudes, heatmaps, and point clouds in a unified way at all scales from micro to macro. We experimentally demonstrate the competitiveness of the new class of features with the traditional ones and their significant synergy while being used together for the person authentication task on the recently published eye movement trajectories dataset.
    D$^\text{2}$UF: Deep Coded Aperture Design and Unrolling Algorithm for Compressive Spectral Image Fusion. (arXiv:2205.12158v1 [eess.IV])
    Compressive spectral imaging (CSI) has attracted significant attention since it employs synthetic apertures to codify spatial and spectral information, sensing only 2D projections of the 3D spectral image. However, these optical architectures suffer from a trade-off between the spatial and spectral resolution of the reconstructed image due to technology limitations. To overcome this issue, compressive spectral image fusion (CSIF) employs the projected measurements of two CSI architectures with different resolutions to estimate a high-spatial high-spectral resolution. This work presents the fusion of the compressive measurements of a low-spatial high-spectral resolution coded aperture snapshot spectral imager (CASSI) architecture and a high-spatial low-spectral resolution multispectral color filter array (MCFA) system. Unlike previous CSIF works, this paper proposes joint optimization of the sensing architectures and a reconstruction network in an end-to-end (E2E) manner. The trainable optical parameters are the coded aperture (CA) in the CASSI and the colored coded aperture in the MCFA system, employing a sigmoid activation function and regularization function to encourage binary values on the trainable variables for an implementation purpose. Additionally, an unrolling-based network inspired by the alternating direction method of multipliers (ADMM) optimization is formulated to address the reconstruction step and the acquisition systems design jointly. Finally, a spatial-spectral inspired loss function is employed at the end of each unrolling layer to increase the convergence of the unrolling network. The proposed method outperforms previous CSIF methods, and experimental results validate the method with real measurements.
    Deeper vs Wider: A Revisit of Transformer Configuration. (arXiv:2205.10505v2 [cs.LG] UPDATED)
    Transformer-based models have delivered impressive results on many tasks, particularly vision and language tasks. In many model training situations, conventional configurations are typically adopted. For example, we often set the base model with hidden dimensions (i.e. model width) to be 768 and the number of transformer layers (i.e. model depth) to be 12. In this paper, we revisit these conventional configurations. Through theoretical analysis and experimental evaluation, we show that the masked autoencoder is effective in alleviating the over-smoothing issue in deep transformer training. Based on this finding, we propose Bamboo, an idea of using deeper and narrower transformer configurations, for masked autoencoder training. On ImageNet, with such a simple change in configuration, re-designed model achieves 87.1% top-1 accuracy and outperforms SoTA models like MAE and BEiT. On language tasks, re-designed model outperforms BERT with default setting by 1.1 points on average, on GLUE datasets.
    Bias Discovery in Machine Learning Models for Mental Health. (arXiv:2205.12093v1 [cs.LG])
    Fairness and bias are crucial concepts in artificial intelligence, yet they are relatively ignored in machine learning applications in clinical psychiatry. We computed fairness metrics and present bias mitigation strategies using a model trained on clinical mental health data. We collected structured data related to the admission, diagnosis, and treatment of patients in the psychiatry department of the University Medical Center Utrecht. We trained a machine learning model to predict future administrations of benzodiazepines on the basis of past data. We found that gender plays an unexpected role in the predictions-this constitutes bias. Using the AI Fairness 360 package, we implemented reweighing and discrimination-aware regularization as bias mitigation strategies, and we explored their implications for model performance. This is the first application of bias exploration and mitigation in a machine learning model trained on real clinical psychiatry data.
    Inference of a Rumor's Source in the Independent Cascade Model. (arXiv:2205.12125v1 [cs.SI])
    We consider the so-called Independent Cascade Model for rumor spreading or epidemic processes popularized by Kempe et al.\ [2003]. In this model, a small subset of nodes from a network are the source of a rumor. In discrete time steps, each informed node "infects" each of its uninformed neighbors with probability $p$. While many facets of this process are studied in the literature, less is known about the inference problem: given a number of infected nodes in a network, can we learn the source of the rumor? In the context of epidemiology this problem is often referred to as patient zero problem. It belongs to a broader class of problems where the goal is to infer parameters of the underlying spreading model, see, e.g., Lokhov [NeurIPS'16] or Mastakouri et al. [NeurIPS'20]. In this work we present a maximum likelihood estimator for the rumor's source, given a snapshot of the process in terms of a set of active nodes $X$ after $t$ steps. Our results show that, for cycle-free graphs, the likelihood estimator undergoes a non-trivial phase transition as a function $t$. We provide a rigorous analysis for two prominent classes of acyclic network, namely $d$-regular trees and Galton-Watson trees, and verify empirically that our heuristics work well in various general networks.
    Phased Progressive Learning with Coupling-Regulation-Imbalance Loss for Imbalanced Classification. (arXiv:2205.12117v1 [cs.LG])
    Deep neural networks generally perform poorly with datasets that suffer from quantity imbalance and classification difficulty imbalance between different classes. In order to alleviate the problem of dataset bias or domain shift in the existing two-stage approaches, a phased progressive learning schedule was proposed for smoothly transferring the training emphasis from representation learning to upper classifier training. This has greater effectivity on datasets that have more severe imbalances or smaller scales. A coupling-regulation-imbalance loss function was designed, coupling a correction term, Focal loss and LDAM loss. Coupling-regulation-imbalance loss can better deal with quantity imbalance and outliers, while regulating focus-of-attention of samples with a variety of classification difficulties. Excellent results were achieved on multiple benchmark datasets using these approaches and they can be easily generalized for other imbalanced classification models. Our code will be open source soon.
    Not too little, not too much: a theoretical analysis of graph (over)smoothing. (arXiv:2205.12156v1 [stat.ML])
    We analyze graph smoothing with \emph{mean aggregation}, where each node successively receives the average of the features of its neighbors. Indeed, it has quickly been observed that Graph Neural Networks (GNNs), which generally follow some variant of Message-Passing (MP) with repeated aggregation, may be subject to the \emph{oversmoothing} phenomenon: by performing too many rounds of MP, the node features tend to converge to a non-informative limit. In the case of mean aggregation, for connected graphs, the node features become constant across the whole graph. At the other end of the spectrum, it is intuitively obvious that \emph{some} MP rounds are necessary, but existing analyses do not exhibit both phenomena at once: beneficial ``finite'' smoothing and oversmoothing in the limit. In this paper, we consider simplified linear GNNs, and rigorously analyze two examples for which a finite number of mean aggregation steps provably improves the learning performance, before oversmoothing kicks in. We consider a latent space random graph model, where node features are partial observations of the latent variables and the graph contains pairwise relationships between them. We show that graph smoothing restores some of the lost information, up to a certain point, by two phenomenon: graph smoothing shrinks non-principal directions in the data faster than principal ones, which is useful for regression, and shrinks nodes within communities faster than they collapse together, which improves classification.
    EXPANSE: A Deep Continual / Progressive Learning System for Deep Transfer Learning. (arXiv:2205.10356v2 [cs.LG] UPDATED)
    Deep transfer learning techniques try to tackle the limitations of deep learning, the dependency on extensive training data and the training costs, by reusing obtained knowledge. However, the current DTL techniques suffer from either catastrophic forgetting dilemma (losing the previously obtained knowledge) or overly biased pre-trained models (harder to adapt to target data) in finetuning pre-trained models or freezing a part of the pre-trained model, respectively. Progressive learning, a sub-category of DTL, reduces the effect of the overly biased model in the case of freezing earlier layers by adding a new layer to the end of a frozen pre-trained model. Even though it has been successful in many cases, it cannot yet handle distant source and target data. We propose a new continual/progressive learning approach for deep transfer learning to tackle these limitations. To avoid both catastrophic forgetting and overly biased-model problems, we expand the pre-trained model by expanding pre-trained layers (adding new nodes to each layer) in the model instead of only adding new layers. Hence the method is named EXPANSE. Our experimental results confirm that we can tackle distant source and target data using this technique. At the same time, the final model is still valid on the source data, achieving a promising deep continual learning approach. Moreover, we offer a new way of training deep learning models inspired by the human education system. We termed this two-step training: learning basics first, then adding complexities and uncertainties. The evaluation implies that the two-step training extracts more meaningful features and a finer basin on the error surface since it can achieve better accuracy in comparison to regular training. EXPANSE (model expansion and two-step training) is a systematic continual learning approach applicable to different problems and DL models.
    SepIt Approaching a Single Channel Speech Separation Bound. (arXiv:2205.11801v1 [eess.AS])
    We present an upper bound for the Single Channel Speech Separation task, which is based on an assumption regarding the nature of short segments of speech. Using the bound, we are able to show that while the recent methods have made significant progress for a few speakers, there is room for improvement for five and ten speakers. We then introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation. At test time, SpeIt has a varying number of iterations per test sample, based on a mutual information criterion that arises from our analysis. In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
    Telling Stories from Computational Notebooks: AI-Assisted Presentation Slides Creation for Presenting Data Science Work. (arXiv:2203.11085v2 [cs.HC] UPDATED)
    Creating presentation slides is a critical but time-consuming task for data scientists. While researchers have proposed many AI techniques to lift data scientists' burden on data preparation and model selection, few have targeted the presentation creation task. Based on the needs identified from a formative study, this paper presents NB2Slides, an AI system that facilitates users to compose presentations of their data science work. NB2Slides uses deep learning methods as well as example-based prompts to generate slides from computational notebooks, and take users' input (e.g., audience background) to structure the slides. NB2Slides also provides an interactive visualization that links the slides with the notebook to help users further edit the slides. A follow-up user evaluation with 12 data scientists shows that participants believed NB2Slides can improve efficiency and reduces the complexity of creating slides. Yet, participants questioned the future of full automation and suggested a human-AI collaboration paradigm.
    Deep Reinforcement Learning for Multi-class Imbalanced Training. (arXiv:2205.12070v1 [cs.LG])
    With the rapid growth of memory and computing power, datasets are becoming increasingly complex and imbalanced. This is especially severe in the context of clinical data, where there may be one rare event for many cases in the majority class. We introduce an imbalanced classification framework, based on reinforcement learning, for training extremely imbalanced data sets, and extend it for use in multi-class settings. We combine dueling and double deep Q-learning architectures, and formulate a custom reward function and episode-training procedure, specifically with the added capability of handling multi-class imbalanced training. Using real-world clinical case studies, we demonstrate that our proposed framework outperforms current state-of-the-art imbalanced learning methods, achieving more fair and balanced classification, while also significantly improving the prediction of minority classes.
    RuMedBench: A Russian Medical Language Understanding Benchmark. (arXiv:2201.06499v2 [cs.CL] UPDATED)
    The paper describes the open Russian medical language understanding benchmark covering several task types (classification, question answering, natural language inference, named entity recognition) on a number of novel text sets. Given the sensitive nature of the data in healthcare, such a benchmark partially closes the problem of Russian medical dataset absence. We prepare the unified format labeling, data split, and evaluation metrics for new tasks. The remaining tasks are from existing datasets with a few modifications. A single-number metric expresses a model's ability to cope with the benchmark. Moreover, we implement several baseline models, from simple ones to neural networks with transformer architecture, and release the code. Expectedly, the more advanced models yield better performance, but even a simple model is enough for a decent result in some tasks. Furthermore, for all tasks, we provide a human evaluation. Interestingly the models outperform humans in the large-scale classification tasks. However, the advantage of natural intelligence remains in the tasks requiring more knowledge and reasoning.
    KQGC: Knowledge Graph Embedding with Smoothing Effects of Graph Convolutions for Recommendation. (arXiv:2205.12102v1 [cs.IR])
    Leveraging graphs on recommender systems has gained popularity with the development of graph representation learning (GRL). In particular, knowledge graph embedding (KGE) and graph neural networks (GNNs) are representative GRL approaches, which have achieved the state-of-the-art performance on several recommendation tasks. Furthermore, combination of KGE and GNNs (KG-GNNs) has been explored and found effective in many academic literatures. One of the main characteristics of GNNs is their ability to retain structural properties among neighbors in the resulting dense representation, which is usually coined as smoothing. The smoothing is specially desired in the presence of homophilic graphs, such as the ones we find on recommender systems. In this paper, we propose a new model for recommender systems named Knowledge Query-based Graph Convolution (KQGC). In contrast to exisiting KG-GNNs, KQGC focuses on the smoothing, and leverages a simple linear graph convolution for smoothing KGE. A pre-trained KGE is fed into KQGC, and it is smoothed by aggregating neighbor knowledge queries, which allow entity-embeddings to be aligned on appropriate vector points for smoothing KGE effectively. We apply the proposed KQGC to a recommendation task that aims prospective users for specific products. Extensive experiments on a real E-commerce dataset demonstrate the effectiveness of KQGC.
    DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning. (arXiv:2204.08499v2 [cs.LG] UPDATED)
    Coreset selection, which aims to select a subset of the most informative training samples, is a long-standing learning problem that can benefit many downstream tasks such as data-efficient learning, continual learning, neural architecture search, active learning, etc. However, many existing coreset selection methods are not designed for deep learning, which may have high complexity and poor generalization performance. In addition, the recently proposed methods are evaluated on models, datasets, and settings of different complexities. To advance the research of coreset selection in deep learning, we contribute a comprehensive code library, namely DeepCore, and provide an empirical study on popular coreset selection methods on CIFAR10 and ImageNet datasets. Extensive experiments on CIFAR10 and ImageNet datasets verify that, although various methods have advantages in certain experiment settings, random selection is still a strong baseline.
    Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning (Replicability Study). (arXiv:2205.00664v2 [cs.LG] UPDATED)
    Test Input Prioritizers (TIP) for Deep Neural Networks (DNN) are an important technique to handle the typically very large test datasets efficiently, saving computation and labeling costs. This is particularly true for large-scale, deployed systems, where inputs observed in production are recorded to serve as potential test or training data for the next versions of the system. Feng et. al. propose DeepGini, a very fast and simple TIP, and show that it outperforms more elaborate techniques such as neuron- and surprise coverage. In a large-scale study (4 case studies, 8 test datasets, 32'200 trained models) we verify their findings. However, we also find that other comparable or even simpler baselines from the field of uncertainty quantification, such as the predicted softmax likelihood or the entropy of the predicted softmax likelihoods perform equally well as DeepGini.
    Selecting Continuous Life-Like Cellular Automata for Halting Unpredictability: Evolving for Abiogenesis. (arXiv:2204.07541v2 [cs.NE] UPDATED)
    Substantial efforts have been applied to engineer CA with desired emergent properties, such as supporting gliders. Recent work in continuous CA has generated a wide variety of compelling bioreminiscent patterns, and the expansion of CA research into continuously-valued domains, multiple channels, and higher dimensions complicates their study. In this work we devise a strategy for evolving CA and CA patterns in two steps, based on the simple idea that CA are likely to be complex and computationally capable if they support patterns that grow indefinitely as well as patterns that vanish completely, and are difficult to predict the difference in advance. The second part of our strategy evolves patterns by selecting for mobility and conservation of mean cell value. We validate our pattern evolution method by re-discovering gliders in 17 of 17 Lenia CA, and also report 4 new evolved CA and 1 randomly evolved CA that support novel evolved glider patterns. The CA reported here share neighborhood kernels with previously described Lenia CA, but exhibit a wider range of typical dynamics than their Lenia counterparts. Code for evolving continuous CA is made available under an MIT License (https://github.com/rivesunder/yuca).
    Learning Stabilizing Policies in Stochastic Control Systems. (arXiv:2205.11991v1 [cs.LG])
    In this work, we address the problem of learning provably stable neural network policies for stochastic control systems. While recent work has demonstrated the feasibility of certifying given policies using martingale theory, the problem of how to learn such policies is little explored. Here, we study the effectiveness of jointly learning a policy together with a martingale certificate that proves its stability using a single learning algorithm. We observe that the joint optimization problem becomes easily stuck in local minima when starting from a randomly initialized policy. Our results suggest that some form of pre-training of the policy is required for the joint optimization to repair and verify the policy successfully.
    Beyond Separability: Analyzing the Linear Transferability of Contrastive Representations to Related Subpopulations. (arXiv:2204.02683v2 [cs.LG] UPDATED)
    Contrastive learning is a highly effective method for learning representations from unlabeled data. Recent works show that contrastive representations can transfer across domains, leading to simple state-of-the-art algorithms for unsupervised domain adaptation. In particular, a linear classifier trained to separate the representations on the source domain can also predict classes on the target domain accurately, even though the representations of the two domains are far from each other. We refer to this phenomenon as linear transferability. This paper analyzes when and why contrastive representations exhibit linear transferability in a general unsupervised domain adaptation setting. We prove that linear transferability can occur when data from the same class in different domains (e.g., photo dogs and cartoon dogs) are more related with each other than data from different classes in different domains (e.g., photo dogs and cartoon cats) are. Our analyses are in a realistic regime where the source and target domains can have unbounded density ratios and be weakly related, and they have distant representations across domains.
    Highly Accurate FMRI ADHD Classification using time distributed multi modal 3D CNNs. (arXiv:2205.11993v1 [cs.LG])
    This work proposes an algorithm for fMRI data analysis for the classification of ADHD disorders. There have been several breakthroughs in the analysis of fMRI via 3D convolutional neural networks (CNNs). With these new techniques it is possible to preserve the 3D spatial data of fMRI data. Additionally there have been recent advances in the use of 3D generative adversarial neural networks (GANs) for the generation of normal MRI data. This work utilizes multi modal 3D CNNs with data augmentation from 3D GAN for ADHD prediction from fMRI. By leveraging a 3D-GAN it would be possible to use deepfake data to enhance the accuracy of 3D CNN classification of brain disorders. A comparison will be made between a time distributed single modal 3D CNN model for classification and the modified multi modal model with MRI data as well.
    PatchNR: Learning from Small Data by Patch Normalizing Flow Regularization. (arXiv:2205.12021v1 [cs.LG])
    Learning neural networks using only a small amount of data is an important ongoing research topic with tremendous potential for applications. In this paper, we introduce a regularizer for the variational modeling of inverse problems in imaging based on normalizing flows. Our regularizer, called patchNR, involves a normalizing flow learned on patches of very few images. The subsequent reconstruction method is completely unsupervised and the same regularizer can be used for different forward operators acting on the same class of images. By investigating the distribution of patches versus those of the whole image class, we prove that our variational model is indeed a MAP approach. Our model can be generalized to conditional patchNRs, if additional supervised information is available. Numerical examples for low-dose CT, limited-angle CT and superresolution of material images demonstrate that our method provides high quality results among unsupervised methods, but requires only few data.
    Improving Human Image Synthesis with Residual Fast Fourier Transformation and Wasserstein Distance. (arXiv:2205.12022v1 [cs.CV])
    With the rapid development of the Metaverse, virtual humans have emerged, and human image synthesis and editing techniques, such as pose transfer, have recently become popular. Most of the existing techniques rely on GANs, which can generate good human images even with large variants and occlusions. But from our best knowledge, the existing state-of-the-art method still has the following problems: the first is that the rendering effect of the synthetic image is not realistic, such as poor rendering of some regions. And the second is that the training of GAN is unstable and slow to converge, such as model collapse. Based on the above two problems, we propose several methods to solve them. To improve the rendering effect, we use the Residual Fast Fourier Transform Block to replace the traditional Residual Block. Then, spectral normalization and Wasserstein distance are used to improve the speed and stability of GAN training. Experiments demonstrate that the methods we offer are effective at solving the problems listed above, and we get state-of-the-art scores in LPIPS and PSNR.
    FedEntropy: Efficient Device Grouping for Federated Learning Using Maximum Entropy Judgment. (arXiv:2205.12038v1 [cs.LG])
    Along with the popularity of Artificial Intelligence (AI) and Internet-of-Things (IoT), Federated Learning (FL) has attracted steadily increasing attentions as a promising distributed machine learning paradigm, which enables the training of a central model on for numerous decentralized devices without exposing their privacy. However, due to the biased data distributions on involved devices, FL inherently suffers from low classification accuracy in non-IID scenarios. Although various device grouping method have been proposed to address this problem, most of them neglect both i) distinct data distribution characteristics of heterogeneous devices, and ii) contributions and hazards of local models, which are extremely important in determining the quality of global model aggregation. In this paper, we present an effective FL method named FedEntropy with a novel dynamic device grouping scheme, which makes full use of the above two factors based on our proposed maximum entropy judgement heuristic.Unlike existing FL methods that directly aggregate local models returned from all the selected devices, in one FL round FedEntropy firstly makes a judgement based on the pre-collected soft labels of selected devices and then only aggregates the local models that can maximize the overall entropy of these soft labels. Without collecting local models that are harmful for aggregation, FedEntropy can effectively improve global model accuracy while reducing the overall communication overhead. Comprehensive experimental results on well-known benchmarks show that, FedEntropy not only outperforms state-of-the-art FL methods in terms of model accuracy and communication overhead, but also can be integrated into them to enhance their classification performance.
    Naive Few-Shot Learning: Sequence Consistency Evaluation. (arXiv:2205.12013v1 [cs.AI])
    Cognitive psychologists often use the term $\textit{fluid intelligence}$ to describe the ability of humans to solve novel tasks without any prior training. In contrast to humans, deep neural networks can perform cognitive tasks only after extensive (pre-)training with a large number of relevant examples. Motivated by fluid intelligence research in the cognitive sciences, we built a benchmark task which we call sequence consistency evaluation (SCE) that can be used to address this gap. Solving the SCE task requires the ability to extract simple rules from sequences, a basic computation that is required for solving various intelligence tests in humans. We tested $\textit{untrained}$ (naive) deep learning models in the SCE task. Specifically, we compared Relation Networks (RN) and Contrastive Predictive Coding (CPC), two models that can extract simple rules from sequences, and found that the latter, which imposes a structure on the predictable rule does better. We further found that simple networks fare better in this task than complex ones. Finally, we show that this approach can be used for security camera anomaly detection without any prior training.
    Theoretical Analysis of Primal-Dual Algorithm for Non-Convex Stochastic Decentralized Optimization. (arXiv:2205.11979v1 [math.OC])
    In recent years, decentralized learning has emerged as a powerful tool not only for large-scale machine learning, but also for preserving privacy. One of the key challenges in decentralized learning is that the data distribution held by each node is statistically heterogeneous. To address this challenge, the primal-dual algorithm called the Edge-Consensus Learning (ECL) was proposed and was experimentally shown to be robust to the heterogeneity of data distributions. However, the convergence rate of the ECL is provided only when the objective function is convex, and has not been shown in a standard machine learning setting where the objective function is non-convex. Furthermore, the intuitive reason why the ECL is robust to the heterogeneity of data distributions has not been investigated. In this work, we first investigate the relationship between the ECL and Gossip algorithm and show that the update formulas of the ECL can be regarded as correcting the local stochastic gradient in the Gossip algorithm. Then, we propose the Generalized ECL (G-ECL), which contains the ECL as a special case, and provide the convergence rates of the G-ECL in both (strongly) convex and non-convex settings, which do not depend on the heterogeneity of data distributions. Through synthetic experiments, we demonstrate that the numerical results of both the G-ECL and ECL coincide with the convergence rate of the G-ECL.
    Can Adversarial Training Be Manipulated By Non-Robust Features?. (arXiv:2201.13329v2 [cs.LG] UPDATED)
    Adversarial training, originally designed to resist test-time adversarial examples, has shown to be promising in mitigating training-time availability attacks. This defense ability, however, is challenged in this paper. We identify a novel threat model named stability attacks, which aims to hinder robust availability by slightly manipulating the training data. Under this threat, we show that adversarial training using a conventional defense budget $\epsilon$ provably fails to provide test robustness in a simple statistical setting, where the non-robust features of the training data can be reinforced by $\epsilon$-bounded perturbation. Further, we analyze the necessity of enlarging the defense budget to counter stability attacks. Finally, comprehensive experiments demonstrate that stability attacks are harmful on benchmark datasets, and thus the adaptive defense is necessary to maintain robustness.
    3D helical CT reconstruction with memory efficient invertible Learned Primal-Dual method. (arXiv:2205.11952v1 [eess.IV])
    Helical acquisition geometry is the most common geometry used in computed tomography (CT) scanners for medical imaging. We adapt the invertible Learned Primal-Dual (iLPD) deep neural network architecture so that it can be applied to helical 3D CT reconstruction. We achieve this by splitting the geometry and the data in parts that fit the memory and by splitting images into corresponding sub-volumes. The architecture can be applied to images different in size along the rotation axis. We perform the experiments on tomographic data simulated from realistic helical geometries.
    Realization Theory Of Recurrent Neural ODEs Using Polynomial System Embeddings. (arXiv:2205.11989v1 [math.OC])
    In this paper we show that neural ODE analogs of recurrent (ODE-RNN) and Long Short-Term Memory (ODE-LSTM) networks can be algorithmically embeddeded into the class of polynomial systems. This embedding preserves input-output behavior and can suitably be extended to other neural DE architectures. We then use realization theory of polynomial systems to provide necessary conditions for an input-output map to be realizable by an ODE-LSTM and sufficient conditions for minimality of such systems. These results represent the first steps towards realization theory of recurrent neural ODE architectures, which is is expected be useful for model reduction and learning algorithm analysis of recurrent neural ODEs.
    How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory. (arXiv:2205.11930v1 [cs.CL])
    Human ratings are treated as the gold standard in NLG evaluation. The standard protocol is to collect ratings of generated text, average across annotators, and then rank NLG systems by their average scores. However, little consideration has been given as to whether this approach faithfully captures human preferences. In this work, we analyze this standard protocol through the lens of utility theory in economics. We first identify the implicit assumptions it makes about annotators and find that these assumptions are often violated in practice, in which case annotator ratings become an unfaithful reflection of their preferences. The most egregious violations come from using Likert scales, which provably reverse the direction of the true preference in certain cases. We suggest improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation. For the latter, we propose a new evaluation protocol called $\textit{system-level probabilistic assessment}$ (SPA). In our experiments, we find that according to SPA, annotators prefer larger GPT-3 variants to smaller ones -- as expected -- with all comparisons being statistically significant. In contrast, the standard protocol only yields significant results half the time.
    Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control. (arXiv:2205.11956v1 [stat.ML])
    Most machine learning methods depend on the tuning of hyper-parameters. For kernel ridge regression (KRR) with the Gaussian kernel, the hyper-parameter is the bandwidth. The bandwidth specifies the length-scale of the kernel and has to be carefully selected in order to obtain a model with good generalization. The default method for bandwidth selection is cross-validation, which often yields good results, albeit at high computational costs. Furthermore, the estimates provided by cross-validation tend to have very high variance, especially when training data are scarce. Inspired by Jacobian regularization, we formulate how the derivatives of the functions inferred by KRR with the Gaussian kernel depend on the kernel bandwidth. We then use this expression to propose a closed-form, computationally feather-light, bandwidth selection method based on controlling the Jacobian. In addition, the Jacobian expression illuminates how the bandwidth selection is a trade-off between the smoothness of the inferred function, and the conditioning of the training data kernel matrix. We show on real and synthetic data that compared to cross-validation, our method is considerably more stable in terms of bandwidth selection, and, for small data sets, provides better predictions.
    Assessing the Quality of Computational Notebooks for a Frictionless Transition from Exploration to Production. (arXiv:2205.11941v1 [cs.SE])
    The massive trend of integrating data-driven AI capabilities into traditional software systems is rising new intriguing challenges. One of such challenges is achieving a smooth transition from the explorative phase of Machine Learning projects - in which data scientists build prototypical models in the lab - to their production phase - in which software engineers translate prototypes into production-ready AI components. To narrow down the gap between these two phases, tools and practices adopted by data scientists might be improved by incorporating consolidated software engineering solutions. In particular, computational notebooks have a prominent role in determining the quality of data science prototypes. In my research project, I address this challenge by studying the best practices for collaboration with computational notebooks and proposing proof-of-concept tools to foster guidelines compliance.
    Estimation of Convex Polytopes for Automatic Discovery of Charge State Transitions in Quantum Dot Arrays. (arXiv:2108.09133v2 [cs.LG] UPDATED)
    In spin based quantum dot arrays, material or fabrication imprecisions affect the behaviour of the device, which must be taken into account when controlling it. This requires measuring the shape of specific convex polytopes. In this work, we present an algorithm that automatically discovers count, shape and size of the facets of a convex polytope from measurements. Results on simulated devices as well as a real 2x2 spin qubit array show that we can reliably find the facets of the convex polytopes, including small facets with sizes on the order of the measurement precision.
    A Data-Centric Optimization Framework for Machine Learning. (arXiv:2110.10802v2 [cs.LG] UPDATED)
    Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute. However, as frameworks specialize performance optimization to patterns in popular networks, they implicitly constrain novel and diverse models that drive progress in research. We empower deep learning researchers by defining a flexible and user-customizable pipeline for optimizing training of arbitrary deep neural networks, based on data movement minimization. The pipeline begins with standard networks in PyTorch or ONNX and transforms computation through progressive lowering. We define four levels of general-purpose transformations, from local intra-operator optimizations to global data movement reduction. These operate on a data-centric graph intermediate representation that expresses computation and data movement at all levels of abstraction, including expanding basic operators such as convolutions to their underlying computations. Central to the design is the interactive and introspectable nature of the pipeline. Every part is extensible through a Python API, and can be tuned interactively using a GUI. We demonstrate competitive performance or speedups on ten different networks, with interactive optimizations discovering new opportunities in EfficientNet.
    MS-nowcasting: Operational Precipitation Nowcasting with Convolutional LSTMs at Microsoft Weather. (arXiv:2111.09954v2 [cs.LG] UPDATED)
    We present the encoder-forecaster convolutional long short-term memory (LSTM) deep-learning model that powers Microsoft Weather's operational precipitation nowcasting product. This model takes as input a sequence of weather radar mosaics and deterministically predicts future radar reflectivity at lead times up to 6 hours. By stacking a large input receptive field along the feature dimension and conditioning the model's forecaster with predictions from the physics-based High Resolution Rapid Refresh (HRRR) model, we are able to outperform optical flow and HRRR baselines by 20-25% on multiple metrics averaged over all lead times.
    Predicting Physics in Mesh-reduced Space with Temporal Attention. (arXiv:2201.09113v3 [cs.LG] UPDATED)
    Graph-based next-step prediction models have recently been very successful in modeling complex high-dimensional physical systems on irregular meshes. However, due to their short temporal attention span, these models suffer from error accumulation and drift. In this paper, we propose a new method that captures long-term dependencies through a transformer-style temporal attention model. We introduce an encoder-decoder structure to summarize features and create a compact mesh representation of the system state, to allow the temporal model to operate on a low-dimensional mesh representations in a memory efficient manner. Our method outperforms a competitive GNN baseline on several complex fluid dynamics prediction tasks, from sonic shocks to vascular flow. We demonstrate stable rollouts without the need for training noise and show perfectly phase-stable predictions even for very long sequences. More broadly, we believe our approach paves the way to bringing the benefits of attention-based sequence models to solving high-dimensional complex physics tasks.
    From Predictions to Decisions: The Importance of Joint Predictive Distributions. (arXiv:2107.09224v3 [cs.LG] UPDATED)
    A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes? Most work on supervised learning has focused on producing accurate marginal predictions for each input. However, we show that for a broad class of decision problems, accurate joint predictions are required to deliver good performance. In particular, we establish several results pertaining to combinatorial decision problems, sequential predictions, and multi-armed bandits to elucidate the essential role of joint predictive distributions. Our treatment of multi-armed bandits introduces an approximate Thompson sampling algorithm and analytic techniques that lead to a new kind of regret bound.
    Learning Interacting Dynamical Systems with Latent Gaussian Process ODEs. (arXiv:2205.11894v1 [cs.LG])
    We study for the first time uncertainty-aware modeling of continuous-time dynamics of interacting objects. We introduce a new model that decomposes independent dynamics of single objects accurately from their interactions. By employing latent Gaussian process ordinary differential equations, our model infers both independent dynamics and their interactions with reliable uncertainty estimates. In our formulation, each object is represented as a graph node and interactions are modeled by accumulating the messages coming from neighboring objects. We show that efficient inference of such a complex network of variables is possible with modern variational sparse Gaussian process inference techniques. We empirically demonstrate that our model improves the reliability of long-term predictions over neural network based alternatives and it successfully handles missing dynamic or static information. Furthermore, we observe that only our model can successfully encapsulate independent dynamics and interaction information in distinct functions and show the benefit from this disentanglement in extrapolation scenarios.
    Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision. (arXiv:2205.11913v1 [cs.DC])
    Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-SystemGroup/Awesome-DL-Scheduling-Papers
    Large Language Models are Zero-Shot Reasoners. (arXiv:2205.11916v1 [cs.CL])
    Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding ``Let's think step by step'' before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with an off-the-shelf 175B parameter model. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted through simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.
    Physics-Embedded Neural Networks: $\boldsymbol{\mathrm{E}(n)}$-Equivariant Graph Neural PDE Solvers. (arXiv:2205.11912v1 [cs.LG])
    Graph neural network (GNN) is a promising approach to learning and predicting physical phenomena described in boundary value problems, such as partial differential equations (PDEs) with boundary conditions. However, existing models inadequately treat boundary conditions essential for the reliable prediction of such problems. In addition, because of the locally connected nature of GNNs, it is difficult to accurately predict the state after a long time, where interaction between vertices tends to be global. We present our approach termed physics-embedded neural networks that considers boundary conditions and predicts the state after a long time using an implicit method. It is built based on an $\mathrm{E}(n)$-equivariant GNN, resulting in high generalization performance on various shapes. We demonstrate that our model learns flow phenomena in complex shapes and outperforms a well-optimized classical solver and a state-of-the-art machine learning model in speed-accuracy trade-off. Therefore, our model can be a useful standard for realizing reliable, fast, and accurate GNN-based PDE solvers.
    The Data-Production Dispositif. (arXiv:2205.11963v1 [cs.HC])
    Machine learning (ML) depends on data to train and verify models. Very often, organizations outsource processes related to data work (i.e., generating and annotating data and evaluating outputs) through business process outsourcing (BPO) companies and crowdsourcing platforms. This paper investigates outsourced ML data work in Latin America by studying three platforms in Venezuela and a BPO in Argentina. We lean on the Foucauldian notion of dispositif to define the data-production dispositif as an ensemble of discourses, actions, and objects strategically disposed to (re)produce power/knowledge relations in data and labor. Our dispositif analysis comprises the examination of 210 data work instruction documents, 55 interviews with data workers, managers, and requesters, and participant observation. Our findings show that discourses encoded in instructions reproduce and normalize the worldviews of requesters. Precarious working conditions and economic dependency alienate workers, making them obedient to instructions. Furthermore, discourses and social contexts materialize in artifacts, such as interfaces and performance metrics, limiting workers' agency and normalizing specific ways of interpreting data. We conclude by stressing the importance of counteracting the data-production dispositif by fighting alienation and precarization, and empowering data workers to become assets in the quest for high-quality data.
    Multi-Agent Collaborative Inference via DNN Decoupling: Intermediate Feature Compression and Edge Learning. (arXiv:2205.11854v1 [cs.LG])
    Recently, deploying deep neural network (DNN) models via collaborative inference, which splits a pre-trained model into two parts and executes them on user equipment (UE) and edge server respectively, becomes attractive. However, the large intermediate feature of DNN impedes flexible decoupling, and existing approaches either focus on the single UE scenario or simply define tasks considering the required CPU cycles, but ignore the indivisibility of a single DNN layer. In this paper, we study the multi-agent collaborative inference scenario, where a single edge server coordinates the inference of multiple UEs. Our goal is to achieve fast and energy-efficient inference for all UEs. To achieve this goal, we first design a lightweight autoencoder-based method to compress the large intermediate feature. Then we define tasks according to the inference overhead of DNNs and formulate the problem as a Markov decision process (MDP). Finally, we propose a multi-agent hybrid proximal policy optimization (MAHPPO) algorithm to solve the optimization problem with a hybrid action space. We conduct extensive experiments with different types of networks, and the results show that our method can reduce up to 56\% of inference latency and save up to 72\% of energy consumption.
    Causal Influences Decouple From Their Underlying Network Structure In Echo State Networks. (arXiv:2205.11947v1 [cs.LG])
    Echo State Networks (ESN) are versatile recurrent neural network models in which the hidden layer remains unaltered during training. Interactions among nodes of this static backbone produce diverse representations of the given stimuli that are harnessed by a read-out mechanism to perform computations needed for solving a given task. ESNs are accessible models of neuronal circuits, since they are relatively inexpensive to train. Therefore, ESNs have become attractive for neuroscientists studying the relationship between neural structure, function, and behavior. For instance, it is not yet clear how distinctive connectivity patterns of brain networks support effective interactions among their nodes and how these patterns of interactions give rise to computation. To address this question, we employed an ESN with a biologically inspired structure and used a systematic multi-site lesioning framework to quantify the causal contribution of each node to the network's output, thus providing a causal link between network structure and behavior. We then focused on the structure-function relationship and decomposed the causal influence of each node on all other nodes, using the same lesioning framework. We found that nodes in a properly engineered ESN interact largely irrespective of the network's underlying structure. However, in a network with the same topology and a non-optimal parameter set, the underlying connectivity patterns determine the node interactions. Our results suggest that causal structure-function relations in ESNs can be decomposed into two components, direct and indirect interactions. The former are based on influences relying on structural connections. The latter describe the effective communication between any two nodes through other intermediate nodes. These widely distributed indirect interactions may crucially contribute to the efficient performance of ESNs.
    An Adaptive Contrastive Learning Model for Spike Sorting. (arXiv:2205.11914v1 [cs.LG])
    Brain-computer interfaces (BCIs), is ways for electronic devices to communicate directly with the brain. For most medical-type brain-computer interface tasks, the activity of multiple units of neurons or local field potentials is sufficient for decoding. But for BCIs used in neuroscience research, it is important to separate out the activity of individual neurons. With the development of large-scale silicon technology and the increasing number of probe channels, artificially interpreting and labeling spikes is becoming increasingly impractical. In this paper, we propose a novel modeling framework: Adaptive Contrastive Learning Model that learns representations from spikes through contrastive learning based on the maximizing mutual information loss function as a theoretical basis. Based on the fact that data with similar features share the same labels whether they are multi-classified or binary-classified. With this theoretical support, we simplify the multi-classification problem into multiple binary-classification, improving both the accuracy and the runtime efficiency. Moreover, we also introduce a series of enhancements for the spikes, while solving the problem that the classification effect is affected because of the overlapping spikes.
    A Quadrature Rule combining Control Variates and Adaptive Importance Sampling. (arXiv:2205.11890v1 [stat.ML])
    Driven by several successful applications such as in stochastic gradient descent or in Bayesian computation, control variates have become a major tool for Monte Carlo integration. However, standard methods do not allow the distribution of the particles to evolve during the algorithm, as is the case in sequential simulation methods. Within the standard adaptive importance sampling framework, a simple weighted least squares approach is proposed to improve the procedure with control variates. The procedure takes the form of a quadrature rule with adapted quadrature weights to reflect the information brought in by the control variates. The quadrature points and weights do not depend on the integrand, a computational advantage in case of multiple integrands. Moreover, the target density needs to be known only up to a multiplicative constant. Our main result is a non-asymptotic bound on the probabilistic error of the procedure. The bound proves that for improving the estimate's accuracy, the benefits from adaptive importance sampling and control variates can be combined. The good behavior of the method is illustrated empirically on synthetic examples and real-world data for Bayesian linear regression.
    Advanced Manufacturing Configuration by Sample-efficient Batch Bayesian Optimization. (arXiv:2205.11827v1 [cs.LG])
    We propose a framework for the configuration and operation of expensive-to-evaluate advanced manufacturing methods, based on Bayesian optimization. The framework unifies a tailored acquisition function, a parallel acquisition procedure, and the integration of process information providing context to the optimization procedure. The novel acquisition function is demonstrated and analyzed on benchmark illustrative problems. We apply the optimization approach to atmospheric plasma spraying in simulation and experiments. Our results demonstrate that the proposed framework can efficiently find input parameters that produce the desired outcome and minimize the process cost.
    Accelerating Frank-Wolfe via Averaging Step Directions. (arXiv:2205.11794v1 [math.OC])
    The Frank-Wolfe method is a popular method in sparse constrained optimization, due to its fast per-iteration complexity. However, the tradeoff is that its worst case global convergence is comparatively slow, and importantly, is fundamentally slower than its flow rate--that is to say, the convergence rate is throttled by discretization error. In this work, we consider a modified Frank-Wolfe where the step direction is a simple weighted average of past oracle calls. This method requires very little memory and computational overhead, and provably decays this discretization error term. Numerically, we show that this method improves the convergence rate over several problems, especially after the sparse manifold has been detected. Theoretically, we show the method has an overall global convergence rate of $O(1/k^p)$, where $0< p < 1$; after manifold identification, this rate speeds to $O(1/k^{3p/2})$. We also observe that the method achieves this accelerated rate from a very early stage, suggesting a promising mode of acceleration for this family of methods.
    Why KDAC? A general activation function for knowledge discovery. (arXiv:2111.13858v4 [cs.LG] UPDATED)
    Deep learning oriented named entity recognition (DNER) has gradually become the paradigm of knowledge discovery, which greatly promotes domain intelligence. However, the current activation function of DNER fails to treat gradient vanishing, no negative output or non-differentiable existence, which may impede knowledge exploration caused by the omission and incomplete representation of latent semantics. To break through the dilemma, we present a novel activation function termed KDAC. Detailly, KDAC is an aggregation function with multiple conversion modes. The backbone of the activation region is the interaction between exponent and linearity, and the both ends extend through adaptive linear divergence, which surmounts the obstacle of gradient vanishing and no negative output. Crucially, the non-differentiable points are alerted and eliminated by an approximate smoothing algorithm. KDAC has a series of brilliant properties, including nonlinear, stable near-linear transformation and derivative, as well as dynamic style, etc. We perform experiments based on BERT-BiLSTM-CNN-CRF model on six benchmark datasets containing different domain knowledge, such as Weibo, Clinical, E-commerce, Resume, HAZOP and People's daily. The evaluation results show that KDAC is advanced and effective, and can provide more generalized activation to stimulate the performance of DNER. We hope that KDAC can be exploited as a promising activation function to devote itself to the construction of knowledge.
    Energy Forecasting in Smart Grid Systems: A Review of the State-of-the-art Techniques. (arXiv:2011.12598v3 [cs.LG] UPDATED)
    Energy forecasting has a vital role to play in smart grid (SG) systems involving various applications such as demand-side management, load shedding, and optimum dispatch. Managing efficient forecasting while ensuring the least possible prediction error is one of the main challenges posed in the grid today, considering the uncertainty and granularity in SG data. This paper presents a comprehensive and application-oriented review of state-of-the-art forecasting methods for SG systems along with recent developments in probabilistic deep learning (PDL) considering different models and architectures. Traditional point forecasting methods including statistical, machine learning (ML), and deep learning (DL) are extensively investigated in terms of their applicability to energy forecasting. In addition, the significance of hybrid and data pre-processing techniques to support forecasting performance is also studied. A comparative case study using the Victorian electricity consumption and American electric power (AEP) datasets is conducted to analyze the performance of point and probabilistic forecasting methods. The analysis demonstrates higher accuracy of the long-short term memory (LSTM) models with appropriate hyper-parameter tuning among point forecasting methods especially when sample sizes are larger and involve nonlinear patterns with long sequences. Furthermore, Bayesian bidirectional LSTM (BLSTM) as a probabilistic method exhibit the highest accuracy in terms of least pinball score and root mean square error (RMSE).
    Neural Distributed Source Coding. (arXiv:2106.02797v2 [cs.IT] UPDATED)
    Distributed source coding (DSC) is the task of encoding an input in the absence of correlated side information that is only available to the decoder. Remarkably, Slepian and Wolf showed in 1973 that an encoder without access to the side information can asymptotically achieve the same compression rate as when the side information is available to it. While there is vast prior work on this topic, practical DSC has been limited to synthetic datasets and specific correlation structures. Here we present a framework for lossy DSC that is agnostic to the correlation structure and can scale to high dimensions. Rather than relying on hand-crafted source-modeling, our method utilizes a conditional VQ-VAE to learn the distributed encoder and decoder. We evaluate our method on multiple datasets and show that our method can handle complex correlations -- significantly better than the current state-of-the-art method.
    Efficient and Robust Algorithms for Adversarial Linear Contextual Bandits. (arXiv:2002.00287v3 [cs.LG] UPDATED)
    We consider an adversarial variant of the classic $K$-armed linear contextual bandit problem where the sequence of loss functions associated with each arm are allowed to change without restriction over time. Under the assumption that the $d$-dimensional contexts are generated i.i.d.~at random from a known distributions, we develop computationally efficient algorithms based on the classic Exp3 algorithm. Our first algorithm, RealLinExp3, is shown to achieve a regret guarantee of $\widetilde{O}(\sqrt{KdT})$ over $T$ rounds, which matches the best available bound for this problem. Our second algorithm, RobustLinExp3, is shown to be robust to misspecification, in that it achieves a regret bound of $\widetilde{O}((Kd)^{1/3}T^{2/3}) + \varepsilon \sqrt{d} T$ if the true reward function is linear up to an additive nonlinear error uniformly bounded in absolute value by $\varepsilon$. To our knowledge, our performance guarantees constitute the very first results on this problem setting.
    Out-of-domain Detection for Natural Language Understanding in Dialog Systems. (arXiv:1909.03862v4 [cs.CL] UPDATED)
    Natural Language Understanding (NLU) is a vital component of dialogue systems, and its ability to detect Out-of-Domain (OOD) inputs is critical in practical applications, since the acceptance of the OOD input that is unsupported by the current system may lead to catastrophic failure. However, most existing OOD detection methods rely heavily on manually labeled OOD samples and cannot take full advantage of unlabeled data. This limits the feasibility of these models in practical applications. In this paper, we propose a novel model to generate high-quality pseudo OOD samples that are akin to IN-Domain (IND) input utterances, and thereby improves the performance of OOD detection. To this end, an autoencoder is trained to map an input utterance into a latent code. and the codes of IND and OOD samples are trained to be indistinguishable by utilizing a generative adversarial network. To provide more supervision signals, an auxiliary classifier is introduced to regularize the generated OOD samples to have indistinguishable intent labels. Experiments show that these pseudo OOD samples generated by our model can be used to effectively improve OOD detection in NLU. Besides, we also demonstrate that the effectiveness of these pseudo OOD data can be further improved by efficiently utilizing unlabeled data.
    CDFKD-MFS: Collaborative Data-free Knowledge Distillation via Multi-level Feature Sharing. (arXiv:2205.11845v1 [cs.CV])
    Recently, the compression and deployment of powerful deep neural networks (DNNs) on resource-limited edge devices to provide intelligent services have become attractive tasks. Although knowledge distillation (KD) is a feasible solution for compression, its requirement on the original dataset raises privacy concerns. In addition, it is common to integrate multiple pretrained models to achieve satisfactory performance. How to compress multiple models into a tiny model is challenging, especially when the original data are unavailable. To tackle this challenge, we propose a framework termed collaborative data-free knowledge distillation via multi-level feature sharing (CDFKD-MFS), which consists of a multi-header student module, an asymmetric adversarial data-free KD module, and an attention-based aggregation module. In this framework, the student model equipped with a multi-level feature-sharing structure learns from multiple teacher models and is trained together with a generator in an asymmetric adversarial manner. When some real samples are available, the attention module adaptively aggregates predictions of the student headers, which can further improve performance. We conduct extensive experiments on three popular computer visual datasets. In particular, compared with the most competitive alternative, the accuracy of the proposed framework is 1.18\% higher on the CIFAR-100 dataset, 1.67\% higher on the Caltech-101 dataset, and 2.99\% higher on the mini-ImageNet dataset.
    Learning to Assemble Geometric Shapes. (arXiv:2205.11809v1 [cs.CV])
    Assembling parts into an object is a combinatorial problem that arises in a variety of contexts in the real world and involves numerous applications in science and engineering. Previous related work tackles limited cases with identical unit parts or jigsaw-style parts of textured shapes, which greatly mitigate combinatorial challenges of the problem. In this work, we introduce the more challenging problem of shape assembly, which involves textureless fragments of arbitrary shapes with indistinctive junctions, and then propose a learning-based approach to solving it. We demonstrate the effectiveness on shape assembly tasks with various scenarios, including the ones with abnormal fragments (e.g., missing and distorted), the different number of fragments, and different rotation discretization.
    Attributing AUC-ROC to Analyze Binary Classifier Performance. (arXiv:2205.11781v1 [cs.LG])
    Area Under the Receiver Operating Characteristic Curve (AUC-ROC) is a popular evaluation metric for binary classifiers. In this paper, we discuss techniques to segment the AUC-ROC along human-interpretable dimensions. AUC-ROC is not an additive/linear function over the data samples, therefore such segmenting the overall AUC-ROC is different from tabulating the AUC-ROC of data segments. To segment the overall AUC-ROC, we must first solve an \emph{attribution} problem to identify credit for individual examples. We observe that AUC-ROC, though non-linear over examples, is linear over \emph{pairs} of examples. This observation leads to a simple, efficient attribution technique for examples (example attributions), and for pairs of examples (pair attributions). We automatically slice these attributions using decision trees by making the tree predict the attributions; we use the notion of honest estimates along with a t-test to mitigate false discovery. Our experiments with the method show that an inferior model can outperform a superior model (trained to optimize a different training objective) on the inferior model's own training objective, a manifestation of Goodhart's Law. In contrast, AUC attributions enable a reasonable comparison. Example attributions can be used to slice this comparison. Pair attributions are used to categorize pairs of items -- one positively labeled and one negatively -- that the model has trouble separating. These categories identify the decision boundary of the classifier and the headroom to improve AUC.
    Penalized Proximal Policy Optimization for Safe Reinforcement Learning. (arXiv:2205.11814v1 [cs.LG])
    Safe reinforcement learning aims to learn the optimal policy while satisfying safety constraints, which is essential in real-world applications. However, current algorithms still struggle for efficient policy updates with hard constraint satisfaction. In this paper, we propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. Specifically, P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We theoretically prove the exactness of the proposed method with a finite penalty factor and provide a worst-case analysis for approximate error when evaluated on sample trajectories. Moreover, we extend P3O to more challenging multi-constraint and multi-agent scenarios which are less studied in previous work. Extensive experiments show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
    NFL: Robust Learned Index via Distribution Transformation. (arXiv:2205.11807v1 [cs.DB])
    Recent works on learned index open a new direction for the indexing field. The key insight of the learned index is to approximate the mapping between keys and positions with piece-wise linear functions. Such methods require partitioning key space for a better approximation. Although lots of heuristics are proposed to improve the approximation quality, the bottleneck is that the segmentation overheads could hinder the overall performance. This paper tackles the approximation problem by applying a \textit{distribution transformation} to the keys before constructing the learned index. A two-stage Normalizing-Flow-based Learned index framework (NFL) is proposed, which first transforms the original complex key distribution into a near-uniform distribution, then builds a learned index leveraging the transformed keys. For effective distribution transformation, we propose a Numerical Normalizing Flow (Numerical NF). Based on the characteristics of the transformed keys, we propose a robust After-Flow Learned Index (AFLI). To validate the performance, comprehensive evaluations are conducted on both synthetic and real-world workloads, which shows that the proposed NFL produces the highest throughput and the lowest tail latency compared to the state-of-the-art learned indexes.
    Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model. (arXiv:2205.11833v1 [cs.LG])
    Ensembling is a popular method used to improve performance as a last resort. However, ensembling multiple models finetuned from a single pretrained model has been not very effective; this could be due to the lack of diversity among ensemble members. This paper proposes Multi-Ticket Ensemble, which finetunes different subnetworks of a single pretrained model and ensembles them. We empirically demonstrated that winning-ticket subnetworks produced more diverse predictions than dense networks, and their ensemble outperformed the standard ensemble on some tasks.
    Alleviating Robust Overfitting of Adversarial Training With Consistency Regularization. (arXiv:2205.11744v1 [cs.LG])
    Adversarial training (AT) has proven to be one of the most effective ways to defend Deep Neural Networks (DNNs) against adversarial attacks. However, the phenomenon of robust overfitting, i.e., the robustness will drop sharply at a certain stage, always exists during AT. It is of great importance to decrease this robust generalization gap in order to obtain a robust model. In this paper, we present an in-depth study towards the robust overfitting from a new angle. We observe that consistency regularization, a popular technique in semi-supervised learning, has a similar goal as AT and can be used to alleviate robust overfitting. We empirically validate this observation, and find a majority of prior solutions have implicit connections to consistency regularization. Motivated by this, we introduce a new AT solution, which integrates the consistency regularization and Mean Teacher (MT) strategy into AT. Specifically, we introduce a teacher model, coming from the average weights of the student models over the training steps. Then we design a consistency loss function to make the prediction distribution of the student models over adversarial examples consistent with that of the teacher model over clean samples. Experiments show that our proposed method can effectively alleviate robust overfitting and improve the robustness of DNN models against common adversarial attacks.
    Wireless Ad Hoc Federated Learning: A Fully Distributed Cooperative Machine Learning. (arXiv:2205.11779v1 [cs.LG])
    Federated learning has allowed training of a global model by aggregating local models trained on local nodes. However, it still takes client-server model, which can be further distributed, fully decentralized, or even partially connected, or totally opportunistic. In this paper, we propose a wireless ad hoc federated learning (WAFL) -- a fully distributed cooperative machine learning organized by the nodes physically nearby. Here, each node has a wireless interface and can communicate with each other when they are within the radio range. The nodes are expected to move with people, vehicles, or robots, producing opportunistic contacts with each other. In WAFL, each node trains a model individually with the local data it has. When a node encounter with others, they exchange their trained models, and generate new aggregated models, which are expected to be more general compared to the locally trained models on Non-IID data. For evaluation, we have prepared four static communication networks and two types of dynamic and opportunistic communication networks based on random waypoint mobility and community-structured environment, and then studied the training process of a fully connected neural network with 90% Non-IID MNIST dataset. The evaluation results indicate that WAFL allowed the convergence of model parameters among the nodes toward generalization, even with opportunistic node contact scenarios -- whereas in self-training (or lonely training) case, they have diverged. This WAFL's model generalization contributed to achieving higher accuracy 94.7-96.2% to the testing IID dataset compared to the self-training case 84.7%.
    Constrained Monotonic Neural Networks. (arXiv:2205.11775v1 [cs.LG])
    Deep neural networks are becoming increasingly popular in approximating arbitrary functions from noisy data. But wider adoption is being hindered by the need to explain such models and to impose additional constraints on them. Monotonicity constraint is one of the most requested properties in real-world scenarios and is the focus of this paper. One of the oldest ways to construct a monotonic fully connected neural network is to constrain its weights to be non-negative while employing a monotonic activation function. Unfortunately, this construction does not work with popular non-saturated activation functions such as ReLU, ELU, SELU etc, as it can only approximate convex functions. We show this shortcoming can be fixed by employing the original activation function for a part of the neurons in the layer, and employing its point reflection for the other part. Our experiments show this approach of building monotonic deep neural networks have matching or better accuracy when compared to other state-of-the-art methods such as deep lattice networks or monotonic networks obtained by heuristic regularization. This method is the simplest one in the sense of having the least number of parameters, not requiring any modifications to the learning procedure or steps post-learning steps.
    ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest. (arXiv:2205.11728v1 [cs.IR])
    Learned embeddings for products are an important building block for web-scale e-commerce recommendation systems. At Pinterest, we build a single set of product embeddings called ItemSage to provide relevant recommendations in all shopping use cases including user, image and search based recommendations. This approach has led to significant improvements in engagement and conversion metrics, while reducing both infrastructure and maintenance cost. While most prior work focuses on building product embeddings from features coming from a single modality, we introduce a transformer-based architecture capable of aggregating information from both text and image modalities and show that it significantly outperforms single modality baselines. We also utilize multi-task learning to make ItemSage optimized for several engagement types, leading to a candidate generation system that is efficient for all of the engagement objectives of the end-to-end recommendation system. Extensive offline experiments are conducted to illustrate the effectiveness of our approach and results from online A/B experiments show substantial gains in key business metrics (up to +7% gross merchandise value/user and +11% click volume).
    On the Role of Bidirectionality in Language Model Pre-Training. (arXiv:2205.11726v1 [cs.CL])
    Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot priming and fine-tuning. We propose a new framework that generalizes prior approaches, including fully unidirectional models like GPT, fully bidirectional models like BERT, and hybrid models like CM3 and prefix LM. Our framework distinguishes between two notions of bidirectionality (bidirectional context and bidirectional attention) and allows us to control each of them separately. We find that the optimal configuration is largely application-dependent (e.g., bidirectional attention is beneficial for fine-tuning and infilling, but harmful for next token prediction and zero-shot priming). We train models with up to 6.7B parameters, and find differences to remain consistent at scale. While prior work on scaling has focused on left-to-right autoregressive models, our results suggest that this approach comes with some trade-offs, and it might be worthwhile to develop very large bidirectional models.
    BabyBear: Cheap inference triage for expensive language models. (arXiv:2205.11747v1 [cs.CL])
    Transformer language models provide superior accuracy over previous models but they are computationally and environmentally expensive. Borrowing the concept of model cascading from computer vision, we introduce BabyBear, a framework for cascading models for natural language processing (NLP) tasks to minimize cost. The core strategy is inference triage, exiting early when the least expensive model in the cascade achieves a sufficiently high-confidence prediction. We test BabyBear on several open source data sets related to document classification and entity recognition. We find that for common NLP tasks a high proportion of the inference load can be accomplished with cheap, fast models that have learned by observing a deep learning model. This allows us to reduce the compute cost of large-scale classification jobs by more than 50% while retaining overall accuracy. For named entity recognition, we save 33% of the deep learning compute while maintaining an F1 score higher than 95% on the CoNLL benchmark.
    HiPAL: A Deep Framework for Physician Burnout Prediction Using Activity Logs in Electronic Health Records. (arXiv:2205.11680v1 [cs.LG])
    Burnout is a significant public health concern affecting nearly half of the healthcare workforce. This paper presents the first end-to-end deep learning framework for predicting physician burnout based on clinician activity logs, digital traces of their work activities, available in any electronic health record (EHR) system. In contrast to prior approaches that exclusively relied on surveys for burnout measurement, our framework directly learns deep workload representations from large-scale clinician activity logs to predict burnout. We propose the Hierarchical burnout Prediction based on Activity Logs (HiPAL), featuring a pre-trained time-dependent activity embedding mechanism tailored for activity logs and a hierarchical predictive model, which mirrors the natural hierarchical structure of clinician activity logs and captures physician's evolving workload patterns at both short-term and long-term levels. To utilize the large amount of unlabeled activity logs, we propose a semi-supervised framework that learns to transfer knowledge extracted from unlabeled clinician activities to the HiPAL-based prediction model. The experiment on over 15 million clinician activity logs collected from the EHR at a large academic medical center demonstrates the advantages of our proposed framework in predictive performance of physician burnout and training efficiency over state of the art approaches.
    Demand Response Method Considering Multiple Types of Flexible Loads in Industrial Parks. (arXiv:2205.11743v1 [eess.SY])
    With the rapid development of the energy internet, the proportion of flexible loads in smart grid is getting much higher than before. It is highly important to model flexible loads based on demand response. Therefore, a new demand response method considering multiple flexible loads is proposed in this paper to character the integrated demand response (IDR) resources. Firstly, a physical process analytical deduction (PPAD) model is proposed to improve the classification of flexible loads in industrial parks. Scenario generation, data point augmentation, and smooth curves under various operating conditions are considered to enhance the applicability of the model. Secondly, in view of the strong volatility and poor modeling effect of Wasserstein-generative adversarial networks (WGAN), an improved WGAN-gradient penalty (IWGAN-GP) model is developed to get a faster convergence speed than traditional WGAN and generate a higher quality samples. Finally, the PPAD and IWGAN-GP models are jointly implemented to reveal the degree of correlation between flexible loads. Meanwhile, an intelligent offline database is built to deal with the impact of nonlinear factors in different response scenarios. Numerical examples have been performed with the results proving that the proposed method is significantly better than the existing technologies in reducing load modeling deviation and improving the responsiveness of park loads.
    Prediction of the Position of External Markers Using a Recurrent Neural Network Trained With Unbiased Online Recurrent Optimization for Safe Lung Cancer Radiotherapy. (arXiv:2106.01100v3 [eess.IV] UPDATED)
    During lung radiotherapy, the position of infrared reflective objects on the chest can be recorded to estimate the tumor location. However, radiotherapy systems have a latency inherent to robot control limitations that impedes the radiation delivery precision. Prediction with online learning of recurrent neural networks (RNN) allows for adaptation to non-stationary respiratory signals, but classical methods such as RTRL and truncated BPTT are respectively slow and biased. This study investigates the capabilities of unbiased online recurrent optimization (UORO) to forecast respiratory motion and enhance safety in lung radiotherapy. We used 9 observation records of the 3D position of 3 external markers on the chest and abdomen of healthy individuals breathing during intervals from 73s to 222s. The sampling frequency was 10Hz, and the amplitudes of the recorded trajectories range from 6mm to 40mm in the superior-inferior direction. We forecast the 3D location of each marker simultaneously with a horizon value between 0.1s and 2.0s, using an RNN trained with UORO. We compare its performance with an RNN trained with RTRL, LMS, and offline linear regression. We provide closed-form expressions for quantities involved in the gradient loss calculation in UORO, thereby making its implementation efficient. Training and cross-validation were performed during the first minute of each sequence. On average over the horizon values considered and the 9 sequences, UORO achieves the lowest root-mean-square (RMS) error and maximum error among the compared algorithms. These errors are respectively equal to 1.3mm and 8.8mm, and the prediction time per time step was lower than 2.8ms (Dell Intel core i9-9900K 3.60 GHz). Linear regression has the lowest RMS error for the horizon values 0.1s and 0.2s, followed by LMS for horizon values between 0.3s and 0.5s, and UORO for horizon values greater than 0.6s.
    Randomly Initialized One-Layer Neural Networks Make Data Linearly Separable. (arXiv:2205.11716v1 [cs.LG])
    Recently, neural networks have been shown to perform exceptionally well in transforming two arbitrary sets into two linearly separable sets. Doing this with a randomly initialized neural network is of immense interest because the associated computation is cheaper than using fully trained networks. In this paper, we show that, with sufficient width, a randomly initialized one-layer neural network transforms two sets into two linearly separable sets with high probability. Furthermore, we provide explicit bounds on the required width of the neural network for this to occur. Our first bound is exponential in the input dimension and polynomial in all other parameters, while our second bound is independent of the input dimension, thereby overcoming the curse of dimensionality. We also perform an experimental study comparing the separation capacity of randomly initialized one-layer and two-layer neural networks. With correctly chosen biases, our study shows for low-dimensional data, the two-layer neural network outperforms the one-layer network. However, the opposite is observed for higher-dimensional data.
    Embedding Neighborhoods Simultaneously t-SNE (ENS-t-SNE). (arXiv:2205.11720v1 [cs.LG])
    We propose an algorithm for visualizing a dataset by embedding it in 3-dimensional Euclidean space based on various given distances between the same pairs of datapoints. Its aim is to find an Embedding which preserves Neighborhoods Simultaneously for all given distances by generalizing the t-Stochastic Neighborhood Embedding approach (ENS-t-SNE). We illustrate the utility of ENS-t-SNE by demonstrating its use in three applications. First, to visualize different notions of clusters and groups within the same high-dimensional dataset with one 3-dimensional embedding, as opposed to providing different embeddings of the same data and trying to match the corresponding points. Second, to illustrate the effects of different hyper-parameters of the classical t-SNE. Third, by considering multiple different notions of clustering in data, ENS-t-SNE can generate an alternative embedding than the classic t-SNE. We provide an extensive quantitative evaluation with real-world and synthetic datasets of different sizes and using different numbers of projections.
    High-Order Pooling for Graph Neural Networks with Tensor Decomposition. (arXiv:2205.11691v1 [cs.LG])
    Graph Neural Networks (GNNs) are attracting growing attention due to their effectiveness and flexibility in modeling a variety of graph-structured data. Exiting GNN architectures usually adopt simple pooling operations (e.g., sum, average, max) when aggregating messages from a local neighborhood for updating node representation or pooling node representations from the entire graph to compute the graph representation. Though simple and effective, these linear operations do not model high-order non-linear interactions among nodes. We propose the Tensorized Graph Neural Network (tGNN), a highly expressive GNN architecture relying on tensor decomposition to model high-order non-linear node interactions. tGNN leverages the symmetric CP decomposition to efficiently parameterize permutation-invariant multilinear maps for modeling node interactions. Theoretical and empirical analysis on both node and graph classification tasks show the superiority of tGNN over competitive baselines. In particular, tGNN achieves state-of-the-art results on two OGB node classification datasets and one OGB graph classification dataset.
    Semi-Parametric Deep Neural Networks in Linear Time and Memory. (arXiv:2205.11718v1 [cs.LG])
    Recent advances in deep learning have been driven by large-scale parametric models, which can be computationally expensive and lack interpretability. Semi-parametric methods query the training set at inference time and can be more compact, although they typically have quadratic computational complexity. Here, we introduce SPIN, a general-purpose semi-parametric neural architecture whose computational cost is linear in the size and dimensionality of the data. Our architecture is inspired by inducing point methods and relies on a novel application of cross-attention between datapoints. At inference time, its computational cost is constant in the training set size as the data gets distilled into a fixed number of inducing points. We find that our method reduces the computational requirements of existing semi-parametric models by up to an order of magnitude across a range of datasets and improves state-of-the-art performance on an important practical problem, genotype imputation.
    Distributional Hamilton-Jacobi-Bellman Equations for Continuous-Time Reinforcement Learning. (arXiv:2205.12184v1 [cs.LG])
    Continuous-time reinforcement learning offers an appealing formalism for describing control problems in which the passage of time is not naturally divided into discrete increments. Here we consider the problem of predicting the distribution of returns obtained by an agent interacting in a continuous-time, stochastic environment. Accurate return predictions have proven useful for determining optimal policies for risk-sensitive control, learning state representations, multiagent coordination, and more. We begin by establishing the distributional analogue of the Hamilton-Jacobi-Bellman (HJB) equation for It\^o diffusions and the broader class of Feller-Dynkin processes. We then specialize this equation to the setting in which the return distribution is approximated by $N$ uniformly-weighted particles, a common design choice in distributional algorithms. Our derivation highlights additional terms due to statistical diffusivity which arise from the proper handling of distributions in the continuous-time setting. Based on this, we propose a tractable algorithm for approximately solving the distributional HJB based on a JKO scheme, which can be implemented in an online control algorithm. We demonstrate the effectiveness of such an algorithm in a synthetic control problem.
    Policy Compliance Detection via Expression Tree Inference. (arXiv:2205.12259v1 [cs.CL])
    Policy Compliance Detection (PCD) is a task we encounter when reasoning over texts, e.g. legal frameworks. Previous work to address PCD relies heavily on modeling the task as a special case of Recognizing Textual Entailment. Entailment is applicable to the problem of PCD, however viewing the policy as a single proposition, as opposed to multiple interlinked propositions, yields poor performance and lacks explainability. To address this challenge, more recent proposals for PCD have argued for decomposing policies into expression trees consisting of questions connected with logic operators. Question answering is used to obtain answers to these questions with respect to a scenario. Finally, the expression tree is evaluated in order to arrive at an overall solution. However, this work assumes expression trees are provided by experts, thus limiting its applicability to new policies. In this work, we learn how to infer expression trees automatically from policy texts. We ensure the validity of the inferred trees by introducing constrained decoding using a finite state automaton to ensure the generation of valid trees. We determine through automatic evaluation that 63% of the expression trees generated by our constrained generation model are logically equivalent to gold trees. Human evaluation shows that 88% of trees generated by our model are correct.
    Psychotic Relapse Prediction in Schizophrenia Patients using A Mobile Sensing-based Supervised Deep Learning Model. (arXiv:2205.12225v1 [cs.LG])
    Mobile sensing-based modeling of behavioral changes could predict an oncoming psychotic relapse in schizophrenia patients for timely interventions. Deep learning models could complement existing non-deep learning models for relapse prediction by modeling latent behavioral features relevant to the prediction. However, given the inter-individual behavioral differences, model personalization might be required for a predictive model. In this work, we propose RelapsePredNet, a Long Short-Term Memory (LSTM) neural network-based model for relapse prediction. The model is personalized for a particular patient by training using data from patients most similar to the given patient. Several demographics and baseline mental health scores were considered as personalization metrics to define patient similarity. We investigated the effect of personalization on training dataset characteristics, learned embeddings, and relapse prediction performance. We compared RelapsePredNet with a deep learning-based anomaly detection model for relapse prediction. Further, we investigated if RelapsePredNet could complement ClusterRFModel (a random forest model leveraging clustering and template features proposed in prior work) in a fusion model, by identifying latent behavioral features relevant for relapse prediction. The CrossCheck dataset consisting of continuous mobile sensing data obtained from 63 schizophrenia patients, each monitored for up to a year, was used for our evaluations. The proposed RelapsePredNet outperformed the deep learning-based anomaly detection model for relapse prediction. The F2 score for prediction were 0.21 and 0.52 in the full test set and the Relapse Test Set (consisting of data from patients who have had relapse only), respectively. These corresponded to a 29.4% and 38.8% improvement compared to the existing deep learning-based model for relapse prediction.
    Forecasting Multilinear Data via Transform-Based Tensor Autoregression. (arXiv:2205.12201v1 [cs.LG])
    In the era of big data, there is an increasing demand for new methods for analyzing and forecasting 2-dimensional data. The current research aims to accomplish these goals through the combination of time-series modeling and multilinear algebraic systems. We expand previous autoregressive techniques to forecast multilinear data, aptly named the L-Transform Tensor autoregressive (L-TAR for short). Tensor decompositions and multilinear tensor products have allowed for this approach to be a feasible method of forecasting. We achieve statistical independence between the columns of the observations through invertible discrete linear transforms, enabling a divide and conquer approach. We present an experimental validation of the proposed methods on datasets containing image collections, video sequences, sea surface temperature measurements, stock prices, and networks.
    Learning multi-scale functional representations of proteins from single-cell microscopy data. (arXiv:2205.11676v1 [q-bio.QM])
    Protein function is inherently linked to its localization within the cell, and fluorescent microscopy data is an indispensable resource for learning representations of proteins. Despite major developments in molecular representation learning, extracting functional information from biological images remains a non-trivial computational task. Current state-of-the-art approaches use autoencoder models to learn high-quality features by reconstructing images. However, such methods are prone to capturing noise and imaging artifacts. In this work, we revisit deep learning models used for classifying major subcellular localizations, and evaluate representations extracted from their final layers. We show that simple convolutional networks trained on localization classification can learn protein representations that encapsulate diverse functional information, and significantly outperform autoencoder-based models. We also propose a robust evaluation strategy to assess quality of protein representations across different scales of biological function.
    Compressing Deep Graph Neural Networks via Adversarial Knowledge Distillation. (arXiv:2205.11678v1 [cs.LG])
    Deep graph neural networks (GNNs) have been shown to be expressive for modeling graph-structured data. Nevertheless, the over-stacked architecture of deep graph models makes it difficult to deploy and rapidly test on mobile or embedded systems. To compress over-stacked GNNs, knowledge distillation via a teacher-student architecture turns out to be an effective technique, where the key step is to measure the discrepancy between teacher and student networks with predefined distance functions. However, using the same distance for graphs of various structures may be unfit, and the optimal distance formulation is hard to determine. To tackle these problems, we propose a novel Adversarial Knowledge Distillation framework for graph models named GraphAKD, which adversarially trains a discriminator and a generator to adaptively detect and decrease the discrepancy. Specifically, noticing that the well-captured inter-node and inter-class correlations favor the success of deep GNNs, we propose to criticize the inherited knowledge from node-level and class-level views with a trainable discriminator. The discriminator distinguishes between teacher knowledge and what the student inherits, while the student GNN works as a generator and aims to fool the discriminator. To our best knowledge, GraphAKD is the first to introduce adversarial training to knowledge distillation in graph domains. Experiments on node-level and graph-level classification benchmarks demonstrate that GraphAKD improves the student performance by a large margin. The results imply that GraphAKD can precisely transfer knowledge from a complicated teacher GNN to a compact student GNN.
    FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?. (arXiv:2205.11656v1 [cs.LG])
    The existence of a plethora of language models makes the problem of selecting the best one for a custom task challenging. Most state-of-the-art methods leverage transformer-based models (e.g., BERT) or their variants. Training such models and exploring their hyperparameter space, however, is computationally expensive. Prior work proposes several neural architecture search (NAS) methods that employ performance predictors (e.g., surrogate models) to address this issue; however, analysis has been limited to homogeneous models that use fixed dimensionality throughout the network. This leads to sub-optimal architectures. To address this limitation, we propose a suite of heterogeneous and flexible models, namely FlexiBERT, that have varied encoder layers with a diverse set of possible operations and different hidden dimensions. For better-posed surrogate modeling in this expanded design space, we propose a new graph-similarity-based embedding scheme. We also propose a novel NAS policy, called BOSHNAS, that leverages this new scheme, Bayesian modeling, and second-order optimization, to quickly train and use a neural surrogate model to converge to the optimal architecture. A comprehensive set of experiments shows that the proposed policy, when applied to the FlexiBERT design space, pushes the performance frontier upwards compared to traditional models. FlexiBERT-Mini, one of our proposed models, has 3% fewer parameters than BERT-Mini and achieves 8.9% higher GLUE score. A FlexiBERT model with equivalent performance as the best homogeneous model achieves 2.6x smaller size. FlexiBERT-Large, another proposed model, achieves state-of-the-art results, outperforming the baseline models by at least 5.7% on the GLUE benchmark.
    Improving Fairness for Data Valuation in Horizontal Federated Learning. (arXiv:2109.09046v3 [cs.LG] UPDATED)
    Federated learning is an emerging decentralized machine learning scheme that allows multiple data owners to work collaboratively while ensuring data privacy. The success of federated learning depends largely on the participation of data owners. To sustain and encourage data owners' participation, it is crucial to fairly evaluate the quality of the data provided by the data owners and reward them correspondingly. Federated Shapley value, recently proposed by Wang et al. [Federated Learning, 2020], is a measure for data value under the framework of federated learning that satisfies many desired properties for data valuation. However, there are still factors of potential unfairness in the design of federated Shapley value because two data owners with the same local data may not receive the same evaluation. We propose a new measure called completed federated Shapley value to improve the fairness of federated Shapley value. The design depends on completing a matrix consisting of all the possible contributions by different subsets of the data owners. It is shown under mild conditions that this matrix is approximately low-rank by leveraging concepts and tools from optimization. Both theoretical analysis and empirical evaluation verify that the proposed measure does improve fairness in many circumstances.
    Intelligence Primer. (arXiv:2008.07324v3 [cs.AI] UPDATED)
    Intelligence is a fundamental part of all living things, as well as the foundation for Artificial Intelligence. In this primer we explore the ideas associated with intelligence and, by doing so, understand the implications and constraints and potentially outline the capabilities of future systems. Artificial Intelligence, in the form of Machine Learning, has already had a significant impact on our lives. As an exploration, we journey into different parts of intelligence that appear essential. We hope that people find this helpful in determining the future. Also, during the exploration, we hope to create new thought-provoking questions. Intelligence is not a single weighable quantity but a subject that spans Biology, Physics, Philosophy, Cognitive Science, Neuroscience, Psychology, and Computer Science. The historian Yuval Noah Harari pointed out that engineers and scientists in the future will have to broaden their understandings to include disciplines such as Psychology, Philosophy, and Ethics. Fiction writers have long portrayed engineers and scientists as deficient in these areas. Today, in modern society, the emergence of Artificial Intelligence and legal requirements act as forcing functions to push these broader subjects into the foreground. We start with an introduction to intelligence and move quickly to more profound thoughts and ideas. We call this a Life, the Universe, and Everything primer, after the famous science fiction book by Douglas Adams. Forty-two may be the correct answer, but what are the questions?
    Throwing Away Data Improves Worst-Class Error in Imbalanced Classification. (arXiv:2205.11672v1 [stat.ML])
    Class imbalances pervade classification problems, yet their treatment differs in theory and practice. On the one hand, learning theory instructs us that \emph{more data is better}, as sample size relates inversely to the average test error over the entire data distribution. On the other hand, practitioners have long developed a plethora of tricks to improve the performance of learning machines over imbalanced data. These include data reweighting and subsampling, synthetic construction of additional samples from minority classes, ensembling expensive one-versus all architectures, and tweaking classification losses and thresholds. All of these are efforts to minimize the worst-class error, which is often associated to the minority group in the training data, and finds additional motivation in the robustness, fairness, and out-of-distribution literatures. Here we take on the challenge of developing learning theory able to describe the worst-class error of classifiers over linearly-separable data when fitted either on (i) the full training set, or (ii) a subset where the majority class is subsampled to match in size the minority class. We borrow tools from extreme value theory to show that, under distributions with certain tail properties, \emph{throwing away most data from the majority class leads to better worst-class error}.
    Eigenvalue and Generalized Eigenvalue Problems: Tutorial. (arXiv:1903.11240v2 [stat.ML] UPDATED)
    This paper is a tutorial for eigenvalue and generalized eigenvalue problems. We first introduce eigenvalue problem, eigen-decomposition (spectral decomposition), and generalized eigenvalue problem. Then, we mention the optimization problems which yield to the eigenvalue and generalized eigenvalue problems. We also provide examples from machine learning, including principal component analysis, kernel supervised principal component analysis, and Fisher discriminant analysis, which result in eigenvalue and generalized eigenvalue problems. Finally, we introduce the solutions to both eigenvalue and generalized eigenvalue problems.
    Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment. (arXiv:2205.11616v1 [cs.CL])
    Word translation without parallel corpora has become feasible, rivaling the performance of supervised methods. Recent findings have shown that the accuracy and robustness of unsupervised word translation (UWT) can be improved by making use of visual observations, which are universal representations across languages. In this work, we investigate the potential of using not only visual observations but also pretrained language-image models for enabling a more efficient and robust UWT. Specifically, we develop a novel UWT method dubbed Word Alignment using Language-Image Pretraining (WALIP), which leverages visual observations via the shared embedding space of images and texts provided by CLIP models (Radford et al., 2021). WALIP has a two-step procedure. First, we retrieve word pairs with high confidences of similarity, computed using our proposed image-based fingerprints, which define the initial pivot for the word alignment. Second, we apply our robust Procrustes algorithm to estimate the linear mapping between two embedding spaces, which iteratively corrects and refines the estimated alignment. Our extensive experiments show that WALIP improves upon the state-of-the-art performance of bilingual word alignment for a few language pairs across different word embeddings and displays great robustness to the dissimilarity of language pairs or training corpora for two word embeddings.
    Deep Representations for Time-varying Brain Datasets. (arXiv:2205.11648v1 [cs.LG])
    Finding an appropriate representation of dynamic activities in the brain is crucial for many downstream applications. Due to its highly dynamic nature, temporally averaged fMRI (functional magnetic resonance imaging) can only provide a narrow view of underlying brain activities. Previous works lack the ability to learn and interpret the latent dynamics in brain architectures. This paper builds an efficient graph neural network model that incorporates both region-mapped fMRI sequences and structural connectivities obtained from DWI (diffusion-weighted imaging) as inputs. We find good representations of the latent brain dynamics through learning sample-level adaptive adjacency matrices and performing a novel multi-resolution inner cluster smoothing. These modules can be easily adapted to and are potentially useful for other applications outside the neuroscience domain. We also attribute inputs with integrated gradients, which enables us to infer (1) highly involved brain connections and subnetworks for each task, (2) temporal keyframes of imaging sequences that characterize tasks, and (3) subnetworks that discriminate between individual subjects. This ability to identify critical subnetworks that characterize signal states across heterogeneous tasks and individuals is of great importance to neuroscience and other scientific domains. Extensive experiments and ablation studies demonstrate our proposed method's superiority and efficiency in spatial-temporal graph signal modeling with insightful interpretations of brain dynamics.
    A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature. (arXiv:2205.11651v1 [cs.DL])
    Discovering authoritative links between publications and the datasets that they use can be a labor-intensive process. We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets, which complements the work of data librarians. We first describe the components of the pipeline and then apply it to expand an authoritative bibliography linking thousands of social science studies to the data-related publications in which they are used. The pipeline increases recall for literature to review for inclusion in data-related collections of publications and makes it possible to detect informal data references at scale. We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference. Together, these contributions enable future work on data reference, data citation networks, and data reuse.
    DOGE-Train: Discrete Optimization on GPU with End-to-end Training. (arXiv:2205.11638v1 [cs.LG])
    We present a fast, scalable, data-driven approach for solving linear relaxations of 0-1 integer linear programs using a graph neural network. Our solver is based on the Lagrange decomposition based algorithm FastDOG (Abbas et al. (2022)). We make the algorithm differentiable and perform backpropagation through the dual update scheme for end-to-end training of its algorithmic parameters. This allows to preserve the algorithm's theoretical properties including feasibility and guaranteed non-decrease in the lower bound. Since FastDOG can get stuck in suboptimal fixed points, we provide additional freedom to our graph neural network to predict non-parametric update steps for escaping such points while maintaining dual feasibility. For training of the graph neural network we use an unsupervised loss and perform experiments on large-scale real world datasets. We train on smaller problems and test on larger ones showing strong generalization performance with a graph neural network comprising only around 10k parameters. Our solver achieves significantly faster performance and better dual objectives than its non-learned version. In comparison to commercial solvers our learned solver achieves close to optimal objective values of LP relaxations and is faster by up to an order of magnitude on very large problems from structured prediction and on selected combinatorial optimization problems.
    Generalization Gap in Amortized Inference. (arXiv:2205.11640v1 [stat.ML])
    The ability of likelihood-based probabilistic models to generalize to unseen data is central to many machine learning applications such as lossless compression. In this work, we study the generalizations of a popular class of probabilistic models - the Variational Auto-Encoder (VAE). We point out the two generalization gaps that can affect the generalization ability of VAEs and show that the over-fitting phenomenon is usually dominated by the amortized inference network. Based on this observation we propose a new training objective, inspired by the classic wake-sleep algorithm, to improve the generalizations properties of amortized inference. We also demonstrate how it can improve generalization performance in the context of image modeling and lossless compression.
    History Compression via Language Models in Reinforcement Learning. (arXiv:2205.12258v1 [cs.LG])
    In a partially observable Markov decision process (POMDP), an agent typically uses a representation of the past to approximate the underlying MDP. We propose to utilize a frozen Pretrained Language Transformer (PLT) for history representation and compression to improve sample efficiency. To avoid training of the Transformer, we introduce FrozenHopfield, which automatically associates observations with original token embeddings. To form these associations, a modern Hopfield network stores the original token embeddings, which are retrieved by queries that are obtained by a random but fixed projection of observations. Our new method, HELM, enables actor-critic network architectures that contain a pretrained language Transformer for history representation as a memory module. Since a representation of the past need not be learned, HELM is much more sample efficient than competitors. On Minigrid and Procgen environments HELM achieves new state-of-the-art results. Our code is available at https://github.com/ml-jku/helm.
    uGLAD: Sparse graph recovery by optimizing deep unrolled networks. (arXiv:2205.11610v1 [cs.LG])
    Probabilistic Graphical Models (PGMs) are generative models of complex systems. They rely on conditional independence assumptions between variables to learn sparse representations which can be visualized in a form of a graph. Such models are used for domain exploration and structure discovery in poorly understood domains. This work introduces a novel technique to perform sparse graph recovery by optimizing deep unrolled networks. Assuming that the input data $X\in\mathbb{R}^{M\times D}$ comes from an underlying multivariate Gaussian distribution, we apply a deep model on $X$ that outputs the precision matrix $\Theta$, which can also be interpreted as the adjacency matrix. Our model, uGLAD, builds upon and extends the state-of-the-art model GLAD to the unsupervised setting. The key benefits of our model are (1) uGLAD automatically optimizes sparsity-related regularization parameters leading to better performance than existing algorithms. (2) We introduce multi-task learning based `consensus' strategy for robust handling of missing data in an unsupervised setting. We evaluate model results on synthetic Gaussian data, non-Gaussian data generated from Gene Regulatory Networks, and present a case study in anaerobic digestion.
    PrivFairFL: Privacy-Preserving Group Fairness in Federated Learning. (arXiv:2205.11584v1 [cs.LG])
    Group fairness ensures that the outcome of machine learning (ML) based decision making systems are not biased towards a certain group of people defined by a sensitive attribute such as gender or ethnicity. Achieving group fairness in Federated Learning (FL) is challenging because mitigating bias inherently requires using the sensitive attribute values of all clients, while FL is aimed precisely at protecting privacy by not giving access to the clients' data. As we show in this paper, this conflict between fairness and privacy in FL can be resolved by combining FL with Secure Multiparty Computation (MPC) and Differential Privacy (DP). In doing so, we propose a method for training group-fair ML models in cross-device FL under complete and formal privacy guarantees, without requiring the clients to disclose their sensitive attribute values.
    Quasi Black-Box Variational Inference with Natural Gradients for Bayesian Learning. (arXiv:2205.11568v1 [stat.ML])
    We develop an optimization algorithm suitable for Bayesian learning in complex models. Our approach relies on natural gradient updates within a general black-box framework for efficient training with limited model-specific derivations. It applies within the class of exponential-family variational posterior distributions, for which we extensively discuss the Gaussian case for which the updates have a rather simple form. Our Quasi Black-box Variational Inference (QBVI) framework is readily applicable to a wide class of Bayesian inference problems and is of simple implementation as the updates of the variational posterior do not involve gradients with respect to the model parameters, nor the prescription of the Fisher information matrix. We develop QBVI under different hypotheses for the posterior covariance matrix, discuss details about its robust and feasible implementation, and provide a number of real-world applications to demonstrate its effectiveness.
    Cardiomegaly Detection using Deep Convolutional Neural Network with U-Net. (arXiv:2205.11515v1 [eess.IV])
    Cardiomegaly is indeed a medical disease in which the heart is enlarged. Cardiomegaly is better to handle if caught early, so early detection is critical. The chest X-ray, being one of the most often used radiography examinations, has been used to detect and visualize abnormalities of human organs for decades. X-ray is also a significant medical diagnosis tool for cardiomegaly. Even for domain experts, distinguishing the many types of diseases from the X-ray is a difficult and time-consuming task. Deep learning models are also most effective when used on huge data sets, yet due to privacy concerns, large datasets are rarely available inside the medical industry. A Deep learning-based customized retrained U-Net model for detecting Cardiomegaly disease is presented in this research. In the training phase, chest X-ray images from the "ChestX-ray8" open source real dataset are used. To reduce computing time, this model performs data preprocessing, picture improvement, image compression, and classification before moving on to the training step. The work used a chest x-ray image dataset to simulate and produced a diagnostic accuracy of 94%, a sensitivity of 96.2 percent, and a specificity of 92.5 percent, which beats prior pre-trained model findings for identifying Cardiomegaly disease.
    BolT: Fused Window Transformers for fMRI Time Series Analysis. (arXiv:2205.11578v1 [eess.SP])
    Functional magnetic resonance imaging (fMRI) enables examination of inter-regional interactions in the brain via functional connectivity (FC) analyses that measure the synchrony between the temporal activations of separate regions. Given their exceptional sensitivity, deep-learning methods have received growing interest for FC analyses of high-dimensional fMRI data. In this domain, models that operate directly on raw time series as opposed to pre-computed FC features have the potential benefit of leveraging the full scale of information present in fMRI data. However, previous models are based on architectures suboptimal for temporal integration of representations across multiple time scales. Here, we present BolT, blood-oxygen-level-dependent transformer, for analyzing multi-variate fMRI time series. BolT leverages a cascade of transformer encoders equipped with a novel fused window attention mechanism. Transformer encoding is performed on temporally-overlapped time windows within the fMRI time series to capture short time-scale representations. To integrate information across windows, cross-window attention is computed between base tokens in each time window and fringe tokens from neighboring time windows. To transition from local to global representations, the extent of window overlap and thereby number of fringe tokens is progressively increased across the cascade. Finally, a novel cross-window regularization is enforced to align the high-level representations of global $CLS$ features across time windows. Comprehensive experiments on public fMRI datasets clearly illustrate the superior performance of BolT against state-of-the-art methods. Posthoc explanatory analyses to identify landmark time points and regions that contribute most significantly to model decisions corroborate prominent neuroscientific findings from recent fMRI studies.
    Privacy-preserving Data Filtering in Federated Learning Using Influence Approximation. (arXiv:2205.11518v1 [cs.CR])
    Federated Learning by nature is susceptible to low-quality, corrupted, or even malicious data that can severely degrade the quality of the learned model. Traditional techniques for data valuation cannot be applied as the data is never revealed. We present a novel technique for filtering, and scoring data based on a practical influence approximation that can be implemented in a privacy-preserving manner. Each agent uses his own data to evaluate the influence of another agent's batch, and reports to the center an obfuscated score using differential privacy. Our technique allows for almost perfect ($>92\%$ recall) filtering of corrupted data in a variety of applications using real-data. Importantly, the accuracy does not degrade significantly, even under really strong privacy guarantees ($\varepsilon \leq 1$), especially under realistic percentages of mislabeled data (for $15\%$ mislabeled data we only lose $10\%$ in accuracy).
    Byzantine Machine Learning Made Easy by Resilient Averaging of Momentums. (arXiv:2205.12173v1 [cs.LG])
    Byzantine resilience emerged as a prominent topic within the distributed machine learning community. Essentially, the goal is to enhance distributed optimization algorithms, such as distributed SGD, in a way that guarantees convergence despite the presence of some misbehaving (a.k.a., {\em Byzantine}) workers. Although a myriad of techniques addressing the problem have been proposed, the field arguably rests on fragile foundations. These techniques are hard to prove correct and rely on assumptions that are (a) quite unrealistic, i.e., often violated in practice, and (b) heterogeneous, i.e., making it difficult to compare approaches. We present \emph{RESAM (RESilient Averaging of Momentums)}, a unified framework that makes it simple to establish optimal Byzantine resilience, relying only on standard machine learning assumptions. Our framework is mainly composed of two operators: \emph{resilient averaging} at the server and \emph{distributed momentum} at the workers. We prove a general theorem stating the convergence of distributed SGD under RESAM. Interestingly, demonstrating and comparing the convergence of many existing techniques become direct corollaries of our theorem, without resorting to stringent assumptions. We also present an empirical evaluation of the practical relevance of RESAM.
    Learning for Expressive Task-Related Sentence Representations. (arXiv:2205.12186v1 [cs.CL])
    NLP models learn sentence representations for downstream tasks by tuning a model which is pre-trained by masked language modeling. However, after tuning, the learned sentence representations may be skewed heavily toward label space and thus are not expressive enough to represent whole samples, which should contain task-related information of both sentence inputs and labels. In this work, we learn expressive sentence representations for supervised tasks which (1). contain task-related information in the sentence inputs, and (2). enable correct label predictions. To achieve this goal, we first propose a new objective which explicitly points out the label token space in the input, and predicts categories of labels via an added [MASK] token. This objective encourages fusing the semantic information of both the label and sentence. Then we develop a neighbor attention module, added on a frozen pre-trained model, to build connections between label/sentence tokens via their neighbors. The propagation can be further guided by the regularization on neighborhood representations to encourage expressiveness. Experimental results show that, despite tuning only 5% additional parameters over a frozen pre-trained model, our model can achieve classification results comparable to the SOTA while maintaining strong expressiveness as well.
    Compression-aware Training of Neural Networks using Frank-Wolfe. (arXiv:2205.11921v1 [cs.LG])
    Many existing Neural Network pruning approaches either rely on retraining to compensate for pruning-caused performance degradation or they induce strong biases to converge to a specific sparse solution throughout training. A third paradigm obtains a wide range of compression ratios from a single dense training run while also avoiding retraining. Recent work of Pokutta et al. (2020) and Miao et al. (2022) suggests that the Stochastic Frank-Wolfe (SFW) algorithm is particularly suited for training state-of-the-art models that are robust to compression. We propose leveraging $k$-support norm ball constraints and demonstrate significant improvements over the results of Miao et al. (2022) in the case of unstructured pruning. We also extend these ideas to the structured pruning domain and propose novel approaches to both ensure robustness to the pruning of convolutional filters as well as to low-rank tensor decompositions of convolutional layers. In the latter case, our approach performs on-par with nuclear-norm regularization baselines while requiring only half of the computational resources. Our findings also indicate that the robustness of SFW-trained models largely depends on the gradient rescaling of the learning rate and we establish a theoretical foundation for that practice.
    Neur2SP: Neural Two-Stage Stochastic Programming. (arXiv:2205.12006v1 [math.OC])
    Stochastic programming is a powerful modeling framework for decision-making under uncertainty. In this work, we tackle two-stage stochastic programs (2SPs), the most widely applied and studied class of stochastic programming models. Solving 2SPs exactly requires evaluation of an expected value function that is computationally intractable. Additionally, having a mixed-integer linear program (MIP) or a nonlinear program (NLP) in the second stage further aggravates the problem difficulty. In such cases, solving them can be prohibitively expensive even if specialized algorithms that exploit problem structure are employed. Finding high-quality (first-stage) solutions -- without leveraging problem structure -- can be crucial in such settings. We develop Neur2SP, a new method that approximates the expected value function via a neural network to obtain a surrogate model that can be solved more efficiently than the traditional extensive formulation approach. Moreover, Neur2SP makes no assumptions about the problem structure, in particular about the second-stage problem, and can be implemented using an off-the-shelf solver and open-source libraries. Our extensive computational experiments on benchmark 2SP datasets from four problem classes with different structures (containing MIP and NLP second-stage problems) show the efficiency (time) and efficacy (solution quality) of Neur2SP. Specifically, the proposed method takes less than 1.66 seconds across all problems, achieving high-quality solutions even as the number of scenarios increases, an ideal property that is difficult to have for traditional 2SP solution techniques. Namely, the most generic baseline method typically requires minutes to hours to find solutions of comparable quality.
    Deep Low-Density Separation for Semi-Supervised Classification. (arXiv:2205.11995v1 [cs.LG])
    Given a small set of labeled data and a large set of unlabeled data, semi-supervised learning (SSL) attempts to leverage the location of the unlabeled datapoints in order to create a better classifier than could be obtained from supervised methods applied to the labeled training set alone. Effective SSL imposes structural assumptions on the data, e.g. that neighbors are more likely to share a classification or that the decision boundary lies in an area of low density. For complex and high-dimensional data, neural networks can learn feature embeddings to which traditional SSL methods can then be applied in what we call hybrid methods. Previously-developed hybrid methods iterate between refining a latent representation and performing graph-based SSL on this representation. In this paper, we introduce a novel hybrid method that instead applies low-density separation to the embedded features. We describe it in detail and discuss why low-density separation may be better suited for SSL on neural network-based embeddings than graph-based algorithms. We validate our method using in-house customer survey data and compare it to other state-of-the-art learning methods. Our approach effectively classifies thousands of unlabeled users from a relatively small number of hand-classified examples.
    Ensemble Multi-Relational Graph Neural Networks. (arXiv:2205.12076v1 [cs.LG])
    It is well established that graph neural networks (GNNs) can be interpreted and designed from the perspective of optimization objective. With this clear optimization objective, the deduced GNNs architecture has sound theoretical foundation, which is able to flexibly remedy the weakness of GNNs. However, this optimization objective is only proved for GNNs with single-relational graph. Can we infer a new type of GNNs for multi-relational graphs by extending this optimization objective, so as to simultaneously solve the issues in previous multi-relational GNNs, e.g., over-parameterization? In this paper, we propose a novel ensemble multi-relational GNNs by designing an ensemble multi-relational (EMR) optimization objective. This EMR optimization objective is able to derive an iterative updating rule, which can be formalized as an ensemble message passing (EnMP) layer with multi-relations. We further analyze the nice properties of EnMP layer, e.g., the relationship with multi-relational personalized PageRank. Finally, a new multi-relational GNNs which well alleviate the over-smoothing and over-parameterization issues are proposed. Extensive experiments conducted on four benchmark datasets well demonstrate the effectiveness of the proposed model.
    Faithful Explanations for Deep Graph Models. (arXiv:2205.11850v1 [cs.LG])
    This paper studies faithful explanations for Graph Neural Networks (GNNs). First, we provide a new and general method for formally characterizing the faithfulness of explanations for GNNs. It applies to existing explanation methods, including feature attributions and subgraph explanations. Second, our analytical and empirical results demonstrate that feature attribution methods cannot capture the nonlinear effect of edge features, while existing subgraph explanation methods are not faithful. Third, we introduce \emph{k-hop Explanation with a Convolutional Core} (KEC), a new explanation method that provably maximizes faithfulness to the original GNN by leveraging information about the graph structure in its adjacency matrix and its \emph{k-th} power. Lastly, our empirical results over both synthetic and real-world datasets for classification and anomaly detection tasks with GNNs demonstrate the effectiveness of our approach.
    Quadratic models for understanding neural network dynamics. (arXiv:2205.11787v1 [cs.LG])
    In this work, we propose using a quadratic model as a tool for understanding properties of wide neural networks in both optimization and generalization. We show analytically that certain deep learning phenomena such as the "catapult phase" from [Lewkowycz et al. 2020], which cannot be captured by linear models, are manifested in the quadratic model for shallow ReLU networks. Furthermore, our empirical results indicate that the behaviour of quadratic models parallels that of neural networks in generalization, especially in the large learning rate regime. We expect that quadratic models will serve as a useful tool for analysis of neural networks.
    Concurrent Credit Assignment for Data-efficient Reinforcement Learning. (arXiv:2205.12020v1 [cs.LG])
    The capability to widely sample the state and action spaces is a key ingredient toward building effective reinforcement learning algorithms. The variational optimization principles exposed in this paper emphasize the importance of an occupancy model to synthesizes the general distribution of the agent's environmental states over which it can act (defining a virtual ``territory''). The occupancy model is the subject of frequent updates as the exploration progresses and that new states are undisclosed during the course of the training. By making a uniform prior assumption, the resulting objective expresses a balance between two concurrent tendencies, namely the widening of the occupancy space and the maximization of the rewards, reminding of the classical exploration/exploitation trade-off. Implemented on an actor-critic off-policy on classic continuous action benchmarks, it is shown to provide significant increase in the sampling efficacy, that is reflected in a reduced training time and higher returns, in both the dense and the sparse rewards cases.
    Quantum Kerr Learning. (arXiv:2205.12004v1 [quant-ph])
    Quantum machine learning is a rapidly evolving area that could facilitate important applications for quantum computing and significantly impact data science. In our work, we argue that a single Kerr mode might provide some extra quantum enhancements when using quantum kernel methods based on various reasons from complexity theory and physics. Furthermore, we establish an experimental protocol, which we call \emph{quantum Kerr learning} based on circuit QED. A detailed study using the kernel method, neural tangent kernel theory, first-order perturbation theory of the Kerr non-linearity, and non-perturbative numerical simulations, shows quantum enhancements could happen in terms of the convergence time and the generalization error, while explicit protocols are also constructed for higher-dimensional input data.
    Pynblint: a Static Analyzer for Python Jupyter Notebooks. (arXiv:2205.11934v1 [cs.SE])
    Jupyter Notebook is the tool of choice of many data scientists in the early stages of ML workflows. The notebook format, however, has been criticized for inducing bad programming practices; indeed, researchers have already shown that open-source repositories are inundated by poor-quality notebooks. Low-quality output from the prototypical stages of ML workflows constitutes a clear bottleneck towards the productization of ML models. To foster the creation of better notebooks, we developed Pynblint, a static analyzer for Jupyter notebooks written in Python. The tool checks the compliance of notebooks (and surrounding repositories) with a set of empirically validated best practices and provides targeted recommendations when violations are detected.
    Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture. (arXiv:2205.11786v1 [cs.LG])
    In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity. The width of these general networks is characterized by the minimum in-degree of their neurons, except for the input and first layers. Our results identify the mathematical structure underlying transition to linearity and generalize a number of recent works aimed at characterizing transition to linearity or constancy of the Neural Tangent Kernel for standard architectures.
    Quarantine: Sparsity Can Uncover the Trojan Attack Trigger for Free. (arXiv:2205.11819v1 [cs.LG])
    Trojan attacks threaten deep neural networks (DNNs) by poisoning them to behave normally on most samples, yet to produce manipulated results for inputs attached with a particular trigger. Several works attempt to detect whether a given DNN has been injected with a specific trigger during the training. In a parallel line of research, the lottery ticket hypothesis reveals the existence of sparse subnetworks which are capable of reaching competitive performance as the dense network after independent training. Connecting these two dots, we investigate the problem of Trojan DNN detection from the brand new lens of sparsity, even when no clean training data is available. Our crucial observation is that the Trojan features are significantly more stable to network pruning than benign features. Leveraging that, we propose a novel Trojan network detection regime: first locating a "winning Trojan lottery ticket" which preserves nearly full Trojan information yet only chance-level performance on clean inputs; then recovering the trigger embedded in this already isolated subnetwork. Extensive experiments on various datasets, i.e., CIFAR-10, CIFAR-100, and ImageNet, with different network architectures, i.e., VGG-16, ResNet-18, ResNet-20s, and DenseNet-100 demonstrate the effectiveness of our proposal. Codes are available at https://github.com/VITA-Group/Backdoor-LTH.
    Functional Network: A Novel Framework for Interpretability of Deep Neural Networks. (arXiv:2205.11702v1 [cs.LG])
    The layered structure of deep neural networks hinders the use of numerous analysis tools and thus the development of its interpretability. Inspired by the success of functional brain networks, we propose a novel framework for interpretability of deep neural networks, that is, the functional network. We construct the functional network of fully connected networks and explore its small-worldness. In our experiments, the mechanisms of regularization methods, namely, batch normalization and dropout, are revealed using graph theoretical analysis and topological data analysis. Our empirical analysis shows the following: (1) Batch normalization enhances model performance by increasing the global e ciency and the number of loops but reduces adversarial robustness by lowering the fault tolerance. (2) Dropout improves generalization and robustness of models by improving the functional specialization and fault tolerance. (3) The models with dierent regularizations can be clustered correctly according to their functional topological dierences, re ecting the great potential of the functional network and topological data analysis in interpretability.
    Soft-SVM Regression For Binary Classification. (arXiv:2205.11735v1 [stat.ML])
    The binomial deviance and the SVM hinge loss functions are two of the most widely used loss functions in machine learning. While there are many similarities between them, they also have their own strengths when dealing with different types of data. In this work, we introduce a new exponential family based on a convex relaxation of the hinge loss function using softness and class-separation parameters. This new family, denoted Soft-SVM, allows us to prescribe a generalized linear model that effectively bridges between logistic regression and SVM classification. This new model is interpretable and avoids data separability issues, attaining good fitting and predictive performance by automatically adjusting for data label separability via the softness parameter. These results are confirmed empirically through simulations and case studies as we compare regularized logistic, SVM, and Soft-SVM regressions and conclude that the proposed model performs well in terms of both classification and prediction errors.
    RCC-GAN: Regularized Compound Conditional GAN for Large-Scale Tabular Data Synthesis. (arXiv:2205.11693v1 [cs.LG])
    This paper introduces a novel generative adversarial network (GAN) for synthesizing large-scale tabular databases which contain various features such as continuous, discrete, and binary. Technically, our GAN belongs to the category of class-conditioned generative models with a predefined conditional vector. However, we propose a new formulation for deriving such a vector incorporating both binary and discrete features simultaneously. We refer to this noble definition as compound conditional vector and employ it for training the generator network. The core architecture of this network is a three-layered deep residual neural network with skip connections. For improving the stability of such complex architecture, we present a regularization scheme towards limiting unprecedented variations on its weight vectors during training. This regularization approach is quite compatible with the nature of adversarial training and it is not computationally prohibitive in runtime. Furthermore, we constantly monitor the variation of the weight vectors for identifying any potential instabilities or irregularities to measure the strength of our proposed regularizer. Toward this end, we also develop a new metric for tracking sudden perturbation on the weight vectors using the singular value decomposition theory. Finally, we evaluate the performance of our proposed synthesis approach on six benchmarking tabular databases, namely Adult, Census, HCDR, Cabs, News, and King. The achieved results corroborate that for the majority of the cases, our proposed RccGAN outperforms other conventional and modern generative models in terms of accuracy, stability, and reliability.
    Byzantine-Robust Federated Learning with Optimal Statistical Rates and Privacy Guarantees. (arXiv:2205.11765v1 [cs.LG])
    We propose Byzantine-robust federated learning protocols with nearly optimal statistical rates. In contrast to prior work, our proposed protocols improve the dimension dependence and achieve a tight statistical rate in terms of all the parameters for strongly convex losses. We benchmark against competing protocols and show the empirical superiority of the proposed protocols. Finally, we remark that our protocols with bucketing can be naturally combined with privacy-guaranteeing procedures to introduce security against a semi-honest server. The code for evaluation is provided in https://github.com/wanglun1996/secure-robust-federated-learning.
    Towards a Defense against Backdoor Attacks in Continual Federated Learning. (arXiv:2205.11736v1 [cs.LG])
    Backdoor attacks are a major concern in federated learning (FL) pipelines where training data is sourced from untrusted clients over long periods of time (i.e., continual learning). Preventing such attacks is difficult because defenders in FL do not have access to raw training data. Moreover, in a phenomenon we call backdoor leakage, models trained continuously eventually suffer from backdoors due to cumulative errors in backdoor defense mechanisms. We propose a novel framework for defending against backdoor attacks in the federated continual learning setting. Our framework trains two models in parallel: a backbone model and a shadow model. The backbone is trained without any defense mechanism to obtain good performance on the main task. The shadow model combines recent ideas from robust covariance estimation-based filters with early-stopping to control the attack success rate even as the data distribution changes. We provide theoretical motivation for this design and show experimentally that our framework significantly improves upon existing defenses against backdoor attacks.
    MOSPAT: AutoML based Model Selection and Parameter Tuning for Time Series Anomaly Detection. (arXiv:2205.11755v1 [cs.LG])
    Organizations leverage anomaly and changepoint detection algorithms to detect changes in user behavior or service availability and performance. Many off-the-shelf detection algorithms, though effective, cannot readily be used in large organizations where thousands of users monitor millions of use cases and metrics with varied time series characteristics and anomaly patterns. The selection of algorithm and parameters needs to be precise for each use case: manual tuning does not scale, and automated tuning requires ground truth, which is rarely available. In this paper, we explore MOSPAT, an end-to-end automated machine learning based approach for model and parameter selection, combined with a generative model to produce labeled data. Our scalable end-to-end system allows individual users in large organizations to tailor time-series monitoring to their specific use case and data characteristics, without expert knowledge of anomaly detection algorithms or laborious manual labeling. Our extensive experiments on real and synthetic data demonstrate that this method consistently outperforms using any single algorithm.
    Semi-Supervised Clustering of Sparse Graphs: Crossing the Information-Theoretic Threshold. (arXiv:2205.11677v1 [stat.ML])
    The stochastic block model is a canonical random graph model for clustering and community detection on network-structured data. Decades of extensive study on the problem have established many profound results, among which the phase transition at the Kesten-Stigum threshold is particularly interesting both from a mathematical and an applied standpoint. It states that no estimator based on the network topology can perform substantially better than chance on sparse graphs if the model parameter is below certain threshold. Nevertheless, if we slightly extend the horizon to the ubiquitous semi-supervised setting, such a fundamental limitation will disappear completely. We prove that with arbitrary fraction of the labels revealed, the detection problem is feasible throughout the parameter domain. Moreover, we introduce two efficient algorithms, one combinatorial and one based on optimization, to integrate label information with graph structures. Our work brings a new perspective to stochastic model of networks and semidefinite program research.
    PCA-Boosted Autoencoders for Nonlinear Dimensionality Reduction in Low Data Regimes. (arXiv:2205.11673v1 [cs.LG])
    Autoencoders (AE) provide a useful method for nonlinear dimensionality reduction but are ill-suited for low data regimes. Conversely, Principal Component Analysis (PCA) is data-efficient but is limited to linear dimensionality reduction, posing a problem when data exhibits inherent nonlinearity. This presents a challenge in various scientific and engineering domains such as the nanophotonic component design, where data exhibits nonlinear features while being expensive to obtain due to costly real measurements or resource-consuming solutions of partial differential equations. To address this difficulty, we propose a technique that harnesses the best of both worlds: an autoencoder that leverages PCA to perform well on scarce nonlinear data. Specifically, we outline a numerically robust PCA-based initialization of AE, which, together with the parameterized ReLU activation function, allows the training process to start from an exact PCA solution and improve upon it. A synthetic example is presented first to study the effects of data nonlinearity and size on the performance of the proposed method. We then evaluate our method on several nanophotonic component design problems where obtaining useful data is expensive. To demonstrate universality, we also apply it to tasks in other scientific domains: a benchmark breast cancer dataset and a gene expression dataset. We show that our proposed approach is substantially better than both PCA and randomly initialized AE in the majority of low-data regime cases we consider, or at least is comparable to the best of either of the other two methods.
    Machine Learning for Electricity Market Clearing. (arXiv:2205.11641v1 [eess.SY])
    This paper seeks to design a machine learning twin of the optimal power flow (OPF) optimization, which is used in market-clearing procedures by wholesale electricity markets. The motivation for the proposed approach stems from the need to obtain the digital twin, which is much faster than the original, while also being sufficiently accurate and producing consistent generation dispatches and locational marginal prices (LMPs), which are primal and dual solutions of the OPF optimization, respectively. Availability of market-clearing tools based on this approach will enable computationally tractable evaluation of multiple dispatch scenarios under a given unit commitment. Rather than direct solution of OPF, the Karush-Kuhn-Tucker (KKT) conditions for the OPF problem in question may be written, and in parallel the LMPs of generators and loads may be expressed in terms of the OPF Lagrangian multipliers. Also, taking advantage of the practical fact that many of the Lagrangian multipliers associated with lines will be zero (thermal limits are not binding), we build and train an ML scheme which maps flexible resources (loads and renewables) to the binding lines, and supplement it with an efficient power-grid aware linear map to optimal dispatch and LMPs. The scheme is validated and illustrated on IEEE models. We also report a trade of analysis between quality of the reconstruction and number of samples needed to train the model.
    FedSA: Accelerating Intrusion Detection in Collaborative Environments with Federated Simulated Annealing. (arXiv:2205.11519v1 [cs.CR])
    Fast identification of new network attack patterns is crucial for improving network security. Nevertheless, identifying an ongoing attack in a heterogeneous network is a non-trivial task. Federated learning emerges as a solution to collaborative training for an Intrusion Detection System (IDS). The federated learning-based IDS trains a global model using local machine learning models provided by federated participants without sharing local data. However, optimization challenges are intrinsic to federated learning. This paper proposes the Federated Simulated Annealing (FedSA) metaheuristic to select the hyperparameters and a subset of participants for each aggregation round in federated learning. FedSA optimizes hyperparameters linked to the global model convergence. The proposal reduces aggregation rounds and speeds up convergence. Thus, FedSA accelerates learning extraction from local models, requiring fewer IDS updates. The proposal assessment shows that the FedSA global model converges in less than ten communication rounds. The proposal requires up to 50% fewer aggregation rounds to achieve approximately 97% accuracy in attack detection than the conventional aggregation approach.
    Forecasting of Non-Stationary Sales Time Series Using Deep Learning. (arXiv:2205.11636v1 [cs.LG])
    The paper describes the deep learning approach for forecasting non-stationary time series with using time trend correction in a neural network model. Along with the layers for predicting sales values, the neural network model includes a subnetwork block for the prediction weight for a time trend term which is added to a predicted sales value. The time trend term is considered as a product of the predicted weight value and normalized time value. The results show that the forecasting accuracy can be essentially improved for non-stationary sales with time trends using the trend correction block in the deep learning model.
    An interpretation of the final fully connected layer. (arXiv:2205.11908v1 [cs.LG])
    In recent years neural networks have achieved state-of-the-art accuracy for various tasks but the the interpretation of the generated outputs still remains difficult. In this work we attempt to provide a method to understand the learnt weights in the final fully connected layer in image classification models. We motivate our method by drawing a connection between the policy gradient objective in RL and supervised learning objective. We suggest that the commonly used cross entropy based supervised learning objective can be regarded as a special case of the policy gradient objective. Using this insight we propose a method to find the most discriminative and confusing parts of an image. Our method does not make any prior assumption about neural network achitecture and has low computational cost. We apply our method on publicly available pre-trained models and report the generated results.
    Identifying (anti-)skyrmions while they form. (arXiv:2205.11535v1 [cond-mat.str-el])
    We use a Convolutional Neural Network (CNN) to identify the relevant features in the thermodynamical phases of a simulated three-dimensional spin-lattice system with ferromagnetic and Dzyaloshinskii-Moriya (DM) interactions. Such features include (anti-)skyrmions, merons, and helical and ferromagnetic states. We use a multi-label classification framework, which is flexible enough to accommodate states that mix different features and phases. We then train the CNN to predict the features of the final state from snapshots of intermediate states of the simulation. The trained model allows identifying the different phases reliably and early in the formation process. Thus, the CNN can significantly speed up the phase diagram calculations by predicting the final phase before the spin-lattice Monte Carlo sampling has converged. We show the prowess of this approach by generating phase diagrams with significantly shorter simulation times.
    Learning Context-Aware Service Representation for Service Recommendation in Workflow Composition. (arXiv:2205.11771v1 [cs.SE])
    As increasingly more software services have been published onto the Internet, it remains a significant challenge to recommend suitable services to facilitate scientific workflow composition. This paper proposes a novel NLP-inspired approach to recommending services throughout a workflow development process, based on incrementally learning latent service representation from workflow provenance. A workflow composition process is formalized as a step-wise, context-aware service generation procedure, which is mapped to next-word prediction in a natural language sentence. Historical service dependencies are extracted from workflow provenance to build and enrich a knowledge graph. Each path in the knowledge graph reflects a scenario in a data analytics experiment, which is analogous to a sentence in a conversation. All paths are thus formalized as composable service sequences and are mined, using various patterns, from the established knowledge graph to construct a corpus. Service embeddings are then learned by applying deep learning model from the NLP field. Extensive experiments on the real-world dataset demonstrate the effectiveness and efficiency of the approach.
    G-Rep: Gaussian Representation for Arbitrary-Oriented Object Detection. (arXiv:2205.11796v1 [cs.CV])
    Arbitrary-oriented object representations contain the oriented bounding box (OBB), quadrilateral bounding box (QBB), and point set (PointSet). Each representation encounters problems that correspond to its characteristics, such as the boundary discontinuity, square-like problem, representation ambiguity, and isolated points, which lead to inaccurate detection. Although many effective strategies have been proposed for various representations, there is still no unified solution. Current detection methods based on Gaussian modeling have demonstrated the possibility of breaking this dilemma; however, they remain limited to OBB. To go further, in this paper, we propose a unified Gaussian representation called G-Rep to construct Gaussian distributions for OBB, QBB, and PointSet, which achieves a unified solution to various representations and problems. Specifically, PointSet or QBB-based objects are converted into Gaussian distributions, and their parameters are optimized using the maximum likelihood estimation algorithm. Then, three optional Gaussian metrics are explored to optimize the regression loss of the detector because of their excellent parameter optimization mechanisms. Furthermore, we also use Gaussian metrics for sampling to align label assignment and regression loss. Experimental results on several public available datasets, DOTA, HRSC2016, UCAS-AOD, and ICDAR2015 show the excellent performance of the proposed method for arbitrary-oriented object detection. The code has been open sourced at https://github.com/open-mmlab/mmrotate.
    Identifying Patient-Specific Root Causes of Disease. (arXiv:2205.11627v1 [stat.ML])
    Complex diseases are caused by a multitude of factors that may differ between patients. As a result, hypothesis tests comparing all patients to all healthy controls can detect many significant variables with inconsequential effect sizes. A few highly predictive root causes may nevertheless generate disease within each patient. In this paper, we define patient-specific root causes as variables subject to exogenous "shocks" which go on to perturb an otherwise healthy system and induce disease. In other words, the variables are associated with the exogenous errors of a structural equation model (SEM), and these errors predict a downstream diagnostic label. We quantify predictivity using sample-specific Shapley values. This derivation allows us to develop a fast algorithm called Root Causal Inference for identifying patient-specific root causes by extracting the error terms of a linear SEM and then computing the Shapley value associated with each error. Experiments highlight considerable improvements in accuracy because the method uncovers root causes that may have large effect sizes at the individual level but clinically insignificant effect sizes at the group level. An R implementation is available at github.com/ericstrobl/RCI.
  • Open

    Forecasting Multilinear Data via Transform-Based Tensor Autoregression. (arXiv:2205.12201v1 [cs.LG])
    In the era of big data, there is an increasing demand for new methods for analyzing and forecasting 2-dimensional data. The current research aims to accomplish these goals through the combination of time-series modeling and multilinear algebraic systems. We expand previous autoregressive techniques to forecast multilinear data, aptly named the L-Transform Tensor autoregressive (L-TAR for short). Tensor decompositions and multilinear tensor products have allowed for this approach to be a feasible method of forecasting. We achieve statistical independence between the columns of the observations through invertible discrete linear transforms, enabling a divide and conquer approach. We present an experimental validation of the proposed methods on datasets containing image collections, video sequences, sea surface temperature measurements, stock prices, and networks.
    Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets. (arXiv:2202.07511v2 [cs.LG] UPDATED)
    We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.
    Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction. (arXiv:2202.11559v2 [physics.comp-ph] UPDATED)
    Parameter reconstructions are indispensable in metrology. Here, the objective is to to explain $K$ experimental measurements by fitting to them a parameterized model of the measurement process. The model parameters are regularly determined by least-square methods, i.e., by minimizing the sum of the squared residuals between the $K$ model predictions and the $K$ experimental observations, $\chi^2$. The model functions often involve computationally demanding numerical simulations. Bayesian optimization methods are specifically suited for minimizing expensive model functions. However, in contrast to least-square methods such as the Levenberg-Marquardt algorithm, they only take the value of $\chi^2$ into account, and neglect the $K$ individual model outputs. We present a Bayesian target-vector optimization scheme with improved performance over previous developments, that considers all $K$ contributions of the model function and that is specifically suited for parameter reconstruction problems which are often based on hundreds of observations. Its performance is compared to established methods for an optical metrology reconstruction problem and two synthetic least-squares problems. The proposed method outperforms established optimization methods. It also enables to determine accurate uncertainty estimates with very few observations of the actual model function by using Markov chain Monte Carlo sampling on a trained surrogate model.
    SepIt Approaching a Single Channel Speech Separation Bound. (arXiv:2205.11801v1 [eess.AS])
    We present an upper bound for the Single Channel Speech Separation task, which is based on an assumption regarding the nature of short segments of speech. Using the bound, we are able to show that while the recent methods have made significant progress for a few speakers, there is room for improvement for five and ten speakers. We then introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation. At test time, SpeIt has a varying number of iterations per test sample, based on a mutual information criterion that arises from our analysis. In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
    One-Pixel Shortcut: on the Learning Preference of Deep Neural Networks. (arXiv:2205.12141v1 [cs.LG])
    Unlearnable examples (ULEs) aim to protect data from unauthorized usage for training DNNs. Error-minimizing noise, which is injected to clean data, is one of the most successful methods for preventing DNNs from giving correct predictions on incoming new data. Nonetheless, under specific training strategies such as adversarial training, the unlearnability of error-minimizing noise will severely degrade. In addition, the transferability of error-minimizing noise is inherently limited by the mismatch between the generator model and the targeted learner model. In this paper, we investigate the mechanism of unlearnable examples and propose a novel model-free method, named \emph{One-Pixel Shortcut}, which only perturbs a single pixel of each image and makes the dataset unlearnable. Our method needs much less computational cost and obtains stronger transferability and thus can protect data from a wide range of different models. Based on this, we further introduce the first unlearnable dataset called CIFAR-10-S, which is indistinguishable from normal CIFAR-10 by human observers and can serve as a benchmark for different models or training strategies to evaluate their abilities to extract critical features from the disturbance of non-semantic representations. The original error-minimizing ULEs will lose efficiency under adversarial training, where the model can get over 83\% clean test accuracy. Meanwhile, even if adversarial training and strong data augmentation like RandAugment are applied together, the model trained on CIFAR-10-S cannot get over 50\% clean test accuracy.
    Factor Analysis, Probabilistic Principal Component Analysis, Variational Inference, and Variational Autoencoder: Tutorial and Survey. (arXiv:2101.00734v2 [stat.ML] UPDATED)
    This is a tutorial and survey paper on factor analysis, probabilistic Principal Component Analysis (PCA), variational inference, and Variational Autoencoder (VAE). These methods, which are tightly related, are dimensionality reduction and generative models. They assume that every data point is generated from or caused by a low-dimensional latent factor. By learning the parameters of distribution of latent space, the corresponding low-dimensional factors are found for the sake of dimensionality reduction. For their stochastic and generative behaviour, these models can also be used for generation of new data points in the data space. In this paper, we first start with variational inference where we derive the Evidence Lower Bound (ELBO) and Expectation Maximization (EM) for learning the parameters. Then, we introduce factor analysis, derive its joint and marginal distributions, and work out its EM steps. Probabilistic PCA is then explained, as a special case of factor analysis, and its closed-form solutions are derived. Finally, VAE is explained where the encoder, decoder and sampling from the latent space are introduced. Training VAE using both EM and backpropagation are explained.
    On the identifiability of mixtures of ranking models. (arXiv:2201.13132v2 [cs.LG] UPDATED)
    Mixtures of ranking models are standard tools for ranking problems. However, even the fundamental question of parameter identifiability is not fully understood: the identifiability of a mixture model with two Bradley-Terry-Luce (BTL) components has remained open. In this work, we show that popular mixtures of ranking models with two components (BTL, multinomial logistic models with slates of size 3, or Plackett-Luce) are generically identifiable, i.e., the ground-truth parameters can be identified except when they are from a pathological subset of measure zero. We provide a framework for verifying the number of solutions in a general family of polynomial systems using algebraic geometry, and apply it to these mixtures of ranking models to establish generic identifiability. The framework can be applied more broadly to other learning models and may be of independent interest.
    Random Feature Amplification: Feature Learning and Generalization in Neural Networks. (arXiv:2202.07626v2 [cs.LG] UPDATED)
    In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics 'amplify' these weak, random features to strong, useful features.
    Randomly Initialized One-Layer Neural Networks Make Data Linearly Separable. (arXiv:2205.11716v1 [cs.LG])
    Recently, neural networks have been shown to perform exceptionally well in transforming two arbitrary sets into two linearly separable sets. Doing this with a randomly initialized neural network is of immense interest because the associated computation is cheaper than using fully trained networks. In this paper, we show that, with sufficient width, a randomly initialized one-layer neural network transforms two sets into two linearly separable sets with high probability. Furthermore, we provide explicit bounds on the required width of the neural network for this to occur. Our first bound is exponential in the input dimension and polynomial in all other parameters, while our second bound is independent of the input dimension, thereby overcoming the curse of dimensionality. We also perform an experimental study comparing the separation capacity of randomly initialized one-layer and two-layer neural networks. With correctly chosen biases, our study shows for low-dimensional data, the two-layer neural network outperforms the one-layer network. However, the opposite is observed for higher-dimensional data.
    Weak Convergence of Approximate reflection coupling and its Application to Non-convex Optimization. (arXiv:2205.11970v1 [math.PR])
    In this paper, we propose a weak approximation of the reflection coupling (RC) for stochastic differential equations (SDEs), and prove it converges weakly to the desired coupling. In contrast to the RC, the proposed approximate reflection coupling (ARC) need not take the hitting time of processes to the diagonal set into consideration and can be defined as the solution of some SDEs on the whole time interval. Therefore, ARC can work effectively against SDEs with different drift terms. As an application of ARC, an evaluation on the effectiveness of the stochastic gradient descent in a non-convex setting is also described. For the sample size $n$, the step size $\eta$, and the batch size $B$, we derive uniform evaluations on the time with orders $n^{-1}$, $\eta^{1/2}$, and $\sqrt{(n - B) / B (n - 1)}$, respectively.
    Learning Interacting Dynamical Systems with Latent Gaussian Process ODEs. (arXiv:2205.11894v1 [cs.LG])
    We study for the first time uncertainty-aware modeling of continuous-time dynamics of interacting objects. We introduce a new model that decomposes independent dynamics of single objects accurately from their interactions. By employing latent Gaussian process ordinary differential equations, our model infers both independent dynamics and their interactions with reliable uncertainty estimates. In our formulation, each object is represented as a graph node and interactions are modeled by accumulating the messages coming from neighboring objects. We show that efficient inference of such a complex network of variables is possible with modern variational sparse Gaussian process inference techniques. We empirically demonstrate that our model improves the reliability of long-term predictions over neural network based alternatives and it successfully handles missing dynamic or static information. Furthermore, we observe that only our model can successfully encapsulate independent dynamics and interaction information in distinct functions and show the benefit from this disentanglement in extrapolation scenarios.
    History Compression via Language Models in Reinforcement Learning. (arXiv:2205.12258v1 [cs.LG])
    In a partially observable Markov decision process (POMDP), an agent typically uses a representation of the past to approximate the underlying MDP. We propose to utilize a frozen Pretrained Language Transformer (PLT) for history representation and compression to improve sample efficiency. To avoid training of the Transformer, we introduce FrozenHopfield, which automatically associates observations with original token embeddings. To form these associations, a modern Hopfield network stores the original token embeddings, which are retrieved by queries that are obtained by a random but fixed projection of observations. Our new method, HELM, enables actor-critic network architectures that contain a pretrained language Transformer for history representation as a memory module. Since a representation of the past need not be learned, HELM is much more sample efficient than competitors. On Minigrid and Procgen environments HELM achieves new state-of-the-art results. Our code is available at https://github.com/ml-jku/helm.
    Bellman-consistent Pessimism for Offline Reinforcement Learning. (arXiv:2106.06926v5 [cs.LG] UPDATED)
    The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.
    MAGMA: Inference and Prediction with Multi-Task Gaussian Processes. (arXiv:2007.10731v2 [stat.CO] UPDATED)
    A novel multi-task Gaussian process (GP) framework is proposed, by using a common mean process for sharing information across tasks. In particular, we investigate the problem of time series forecasting, with the objective to improve multiple-step-ahead predictions. The common mean process is defined as a GP for which the hyper-posterior distribution is tractable. Therefore an EM algorithm is derived for handling both hyper-parameters optimisation and hyper-posterior computation. Unlike previous approaches in the literature, the model fully accounts for uncertainty and can handle irregular grids of observations while maintaining explicit formulations, by modelling the mean process in a unified GP framework. Predictive analytical equations are provided, integrating information shared across tasks through a relevant prior mean. This approach greatly improves the predictive performances, even far from observations, and may reduce significantly the computational complexity compared to traditional multi-task GP models. Our overall algorithm is called \textsc{Magma} (standing for Multi tAsk Gaussian processes with common MeAn). The quality of the mean process estimation, predictive performances, and comparisons to alternatives are assessed in various simulated scenarios and on real datasets.
    Quasi-Equivalence of Width and Depth of Neural Networks. (arXiv:2002.02515v7 [cs.LG] UPDATED)
    While classic studies proved that wide networks allow universal approximation, recent research and successes of deep learning demonstrate the power of deep networks. Based on a symmetric consideration, we investigate if the design of artificial neural networks should have a directional preference, and what the mechanism of interaction is between the width and depth of a network. Inspired by the De Morgan law, we address this fundamental question by establishing a quasi-equivalence between the width and depth of ReLU networks in two aspects. First, we formulate two transforms for mapping an arbitrary ReLU network to a wide network and a deep network respectively for either regression or classification so that the essentially same capability of the original network can be implemented. Then, we replace the mainstream artificial neuron type with a quadratic counterpart, and utilize the factorization and continued fraction representations of the same polynomial function to construct a wide network and a deep network, respectively. Based on our findings, a deep network has a wide equivalent, and vice versa, subject to an arbitrarily small error.
    Nonnegative Tensor Completion via Integer Optimization. (arXiv:2111.04580v2 [cs.LG] UPDATED)
    Unlike matrix completion, tensor completion does not have an algorithm that is known to achieve the information-theoretic sample complexity rate. This paper develops a new algorithm for the special case of completion for nonnegative tensors. We prove that our algorithm converges in a linear (in numerical tolerance) number of oracle steps, while achieving the information-theoretic rate. Our approach is to define a new norm for nonnegative tensors using the gauge of a particular 0-1 polytope; integer linear programming can, in turn, be used to solve linear separation problems over this polytope. We combine this insight with a variant of the Frank-Wolfe algorithm to construct our numerical algorithm, and we demonstrate its effectiveness and scalability through computational experiments using a laptop on tensors with up to one-hundred million entries.
    Stochastic Neural Networks with Infinite Width are Deterministic. (arXiv:2201.12724v2 [cs.LG] UPDATED)
    This work theoretically studies stochastic neural networks, a main type of neural network in use. We prove that as the width of an optimized stochastic neural network tends to infinity, its predictive variance on the training set decreases to zero. Our theory justifies the common intuition that adding stochasticity to the model can help regularize the model by introducing an averaging effect. Two common examples that our theory can be relevant to are neural networks with dropout and Bayesian latent variable models in a special limit. Our result thus helps better understand how stochasticity affects the learning of neural networks and potentially design better architectures for practical problems.
    From Predictions to Decisions: The Importance of Joint Predictive Distributions. (arXiv:2107.09224v3 [cs.LG] UPDATED)
    A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes? Most work on supervised learning has focused on producing accurate marginal predictions for each input. However, we show that for a broad class of decision problems, accurate joint predictions are required to deliver good performance. In particular, we establish several results pertaining to combinatorial decision problems, sequential predictions, and multi-armed bandits to elucidate the essential role of joint predictive distributions. Our treatment of multi-armed bandits introduces an approximate Thompson sampling algorithm and analytic techniques that lead to a new kind of regret bound.
    Logarithmic regret bounds for continuous-time average-reward Markov decision processes. (arXiv:2205.11168v2 [cs.LG] UPDATED)
    We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.
    Efficient and Robust Algorithms for Adversarial Linear Contextual Bandits. (arXiv:2002.00287v3 [cs.LG] UPDATED)
    We consider an adversarial variant of the classic $K$-armed linear contextual bandit problem where the sequence of loss functions associated with each arm are allowed to change without restriction over time. Under the assumption that the $d$-dimensional contexts are generated i.i.d.~at random from a known distributions, we develop computationally efficient algorithms based on the classic Exp3 algorithm. Our first algorithm, RealLinExp3, is shown to achieve a regret guarantee of $\widetilde{O}(\sqrt{KdT})$ over $T$ rounds, which matches the best available bound for this problem. Our second algorithm, RobustLinExp3, is shown to be robust to misspecification, in that it achieves a regret bound of $\widetilde{O}((Kd)^{1/3}T^{2/3}) + \varepsilon \sqrt{d} T$ if the true reward function is linear up to an additive nonlinear error uniformly bounded in absolute value by $\varepsilon$. To our knowledge, our performance guarantees constitute the very first results on this problem setting.
    Not too little, not too much: a theoretical analysis of graph (over)smoothing. (arXiv:2205.12156v1 [stat.ML])
    We analyze graph smoothing with \emph{mean aggregation}, where each node successively receives the average of the features of its neighbors. Indeed, it has quickly been observed that Graph Neural Networks (GNNs), which generally follow some variant of Message-Passing (MP) with repeated aggregation, may be subject to the \emph{oversmoothing} phenomenon: by performing too many rounds of MP, the node features tend to converge to a non-informative limit. In the case of mean aggregation, for connected graphs, the node features become constant across the whole graph. At the other end of the spectrum, it is intuitively obvious that \emph{some} MP rounds are necessary, but existing analyses do not exhibit both phenomena at once: beneficial ``finite'' smoothing and oversmoothing in the limit. In this paper, we consider simplified linear GNNs, and rigorously analyze two examples for which a finite number of mean aggregation steps provably improves the learning performance, before oversmoothing kicks in. We consider a latent space random graph model, where node features are partial observations of the latent variables and the graph contains pairwise relationships between them. We show that graph smoothing restores some of the lost information, up to a certain point, by two phenomenon: graph smoothing shrinks non-principal directions in the data faster than principal ones, which is useful for regression, and shrinks nodes within communities faster than they collapse together, which improves classification.
    Distributional Hamilton-Jacobi-Bellman Equations for Continuous-Time Reinforcement Learning. (arXiv:2205.12184v1 [cs.LG])
    Continuous-time reinforcement learning offers an appealing formalism for describing control problems in which the passage of time is not naturally divided into discrete increments. Here we consider the problem of predicting the distribution of returns obtained by an agent interacting in a continuous-time, stochastic environment. Accurate return predictions have proven useful for determining optimal policies for risk-sensitive control, learning state representations, multiagent coordination, and more. We begin by establishing the distributional analogue of the Hamilton-Jacobi-Bellman (HJB) equation for It\^o diffusions and the broader class of Feller-Dynkin processes. We then specialize this equation to the setting in which the return distribution is approximated by $N$ uniformly-weighted particles, a common design choice in distributional algorithms. Our derivation highlights additional terms due to statistical diffusivity which arise from the proper handling of distributions in the continuous-time setting. Based on this, we propose a tractable algorithm for approximately solving the distributional HJB based on a JKO scheme, which can be implemented in an online control algorithm. We demonstrate the effectiveness of such an algorithm in a synthetic control problem.
    Stereographic Markov Chain Monte Carlo. (arXiv:2205.12112v1 [stat.CO])
    High dimensional distributions, especially those with heavy tails, are notoriously difficult for off-the-shelf MCMC samplers: the combination of unbounded state spaces, diminishing gradient information, and local moves, results in empirically observed "stickiness" and poor theoretical mixing properties -- lack of geometric ergodicity. In this paper, we introduce a new class of MCMC samplers that map the original high dimensional problem in Euclidean space onto a sphere and remedy these notorious mixing problems. In particular, we develop random-walk Metropolis type algorithms as well as versions of Bouncy Particle Sampler that are uniformly ergodic for a large class of light and heavy-tailed distributions and also empirically exhibit rapid convergence in high dimensions. In the best scenario, the proposed samplers can enjoy the ``blessings of dimensionality'' that the mixing time decreases with dimension.
    Dimension-agnostic inference using cross U-statistics. (arXiv:2011.05068v4 [math.ST] UPDATED)
    Classical asymptotic theory for statistical inference usually involves calibrating a statistic by fixing the dimension $d$ while letting the sample size $n$ increase to infinity. Recently, much effort has been dedicated towards understanding how these methods behave in high-dimensional settings, where $d$ and $n$ both increase to infinity together. This often leads to different inference procedures, depending on the assumptions about the dimensionality, leaving the practitioner in a bind: given a dataset with 100 samples in 20 dimensions, should they calibrate by assuming $n \gg d$, or $d/n \approx 0.2$? This paper considers the goal of dimension-agnostic inference; developing methods whose validity does not depend on any assumption on $d$ versus $n$. We introduce an approach that uses variational representations of existing test statistics along with sample splitting and self-normalization to produce a new test statistic with a Gaussian limiting distribution. The resulting statistic can be viewed as a careful modification of degenerate U-statistics, dropping diagonal blocks and retaining off-diagonal blocks. We exemplify our technique for a handful of classical problems including one-sample mean and covariance testing. Our tests are shown to have minimax rate-optimal power against appropriate local alternatives, and their power is optimal up to a $\sqrt 2$ factor. We end by suggesting some next steps for extending dimension-agnostic inference to other problems.
    Semi-Supervised Clustering of Sparse Graphs: Crossing the Information-Theoretic Threshold. (arXiv:2205.11677v1 [stat.ML])
    The stochastic block model is a canonical random graph model for clustering and community detection on network-structured data. Decades of extensive study on the problem have established many profound results, among which the phase transition at the Kesten-Stigum threshold is particularly interesting both from a mathematical and an applied standpoint. It states that no estimator based on the network topology can perform substantially better than chance on sparse graphs if the model parameter is below certain threshold. Nevertheless, if we slightly extend the horizon to the ubiquitous semi-supervised setting, such a fundamental limitation will disappear completely. We prove that with arbitrary fraction of the labels revealed, the detection problem is feasible throughout the parameter domain. Moreover, we introduce two efficient algorithms, one combinatorial and one based on optimization, to integrate label information with graph structures. Our work brings a new perspective to stochastic model of networks and semidefinite program research.  ( 2 min )
    Eigenvalue and Generalized Eigenvalue Problems: Tutorial. (arXiv:1903.11240v2 [stat.ML] UPDATED)
    This paper is a tutorial for eigenvalue and generalized eigenvalue problems. We first introduce eigenvalue problem, eigen-decomposition (spectral decomposition), and generalized eigenvalue problem. Then, we mention the optimization problems which yield to the eigenvalue and generalized eigenvalue problems. We also provide examples from machine learning, including principal component analysis, kernel supervised principal component analysis, and Fisher discriminant analysis, which result in eigenvalue and generalized eigenvalue problems. Finally, we introduce the solutions to both eigenvalue and generalized eigenvalue problems.  ( 2 min )
    Bayesian Calibration of imperfect computer models using Physics-informed priors. (arXiv:2201.06463v2 [stat.ML] UPDATED)
    We introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters and model formulation of computer models, represented by differential equations. We construct physics-informed priors, which are multi-output GP priors that encode the model's structure in the covariance function. We extend this into a fully Bayesian framework that quantifies the uncertainty of physical parameters and model predictions. Since physical models often are imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. To obtain the posterior distributions, we use Hamiltonian Monte Carlo sampling. We demonstrate our approach in a simulation study with hemodynamical models, which are time-dependent differential equations. Data are simulated from a more complex model than our modelling choice, and the aim is to learn physical parameters according to known mathematical connections. To demonstrate the flexibility of our approach, an example using the Heat equation, a space-time dependent differential equation where we consider a case of a biased data-acquisition process is also included. Finally, we fit the hemodynamic model using real data obtained in a medical trial.  ( 2 min )
    Risk-Sensitive Reinforcement Learning via Policy Gradient Search. (arXiv:1810.09126v3 [cs.LG] UPDATED)
    The objective in a traditional reinforcement learning (RL) problem is to find a policy that optimizes the expected value of a performance metric such as the infinite-horizon cumulative discounted or long-run average cost/reward. In practice, optimizing the expected value alone may not be satisfactory, in that it may be desirable to incorporate the notion of risk into the optimization problem formulation, either in the objective or as a constraint. Various risk measures have been proposed in the literature, e.g., exponential utility, variance, percentile performance, chance constraints, value at risk (quantile), conditional value-at-risk, prospect theory and its later enhancement, cumulative prospect theory. In this book, we consider risk-sensitive RL in two settings: one where the goal is to find a policy that optimizes the usual expected value objective while ensuring that a risk constraint is satisfied, and the other where the risk measure is the objective. We survey some of the recent work in this area specifically where policy gradient search is the solution approach. In the first risk-sensitive RL setting, we cover popular risk measures based on variance, conditional value-at-risk, and chance constraints, and present a template for policy gradient-based risk-sensitive RL algorithms using a Lagrangian formulation. For the setting where risk is incorporated directly into the objective function, we consider an exponential utility formulation, cumulative prospect theory, and coherent risk measures. This non-exhaustive survey aims to give a flavor of the challenges involved in solving risk-sensitive RL problems using policy gradient methods, as well as outlining some potential future research directions.  ( 2 min )
    Optimality Conditions and Algorithms for Top-K Arm Identification. (arXiv:2205.12086v1 [stat.ML])
    We consider the top-k arm identification problem for multi-armed bandits with rewards belonging to a one-parameter canonical exponential family. The objective is to select the set of k arms with the highest mean rewards by sequential allocation of sampling efforts. We propose a unified optimal allocation problem that identifies the complexity measures of this problem under the fixed-confidence, fixed-budget settings, and the posterior convergence rate from the Bayesian perspective. We provide the first characterization of its optimality. We provide the first provably optimal algorithm in the fixed-confidence setting for k>1. We also propose an efficient heuristic algorithm for the top-k arm identification problem. Extensive numerical experiments demonstrate superior performance compare to existing methods in all three settings.  ( 2 min )
    DeepKriging: Spatially Dependent Deep Neural Networks for Spatial Prediction. (arXiv:2007.11972v4 [stat.ML] UPDATED)
    In spatial statistics, a common objective is to predict values of a spatial process at unobserved locations by exploiting spatial dependence. Kriging provides the best linear unbiased predictor using covariance functions and is often associated with Gaussian processes. However, when considering non-linear prediction for non-Gaussian and categorical data, the Kriging prediction is no longer optimal, and the associated variance is often overly optimistic. Although deep neural networks (DNNs) are widely used for general classification and prediction, they have not been studied thoroughly for data with spatial dependence. In this work, we propose a novel DNN structure for spatial prediction, where the spatial dependence is captured by adding an embedding layer of spatial coordinates with basis functions. We show in theory and simulation studies that the proposed DeepKriging method has a direct link to Kriging in the Gaussian case, and it has multiple advantages over Kriging for non-Gaussian and non-stationary data, i.e., it provides non-linear predictions and thus has smaller approximation errors, it does not require operations on covariance matrices and thus is scalable for large datasets, and with sufficiently many hidden neurons, it provides the optimal prediction in terms of model capacity. We further explore the possibility of quantifying prediction uncertainties based on density prediction without assuming any data distribution. Finally, we apply the method to predicting PM2.5 concentrations across the continental United States.  ( 2 min )
    Quantum Kerr Learning. (arXiv:2205.12004v1 [quant-ph])
    Quantum machine learning is a rapidly evolving area that could facilitate important applications for quantum computing and significantly impact data science. In our work, we argue that a single Kerr mode might provide some extra quantum enhancements when using quantum kernel methods based on various reasons from complexity theory and physics. Furthermore, we establish an experimental protocol, which we call \emph{quantum Kerr learning} based on circuit QED. A detailed study using the kernel method, neural tangent kernel theory, first-order perturbation theory of the Kerr non-linearity, and non-perturbative numerical simulations, shows quantum enhancements could happen in terms of the convergence time and the generalization error, while explicit protocols are also constructed for higher-dimensional input data.  ( 2 min )
    Byzantine-Robust Federated Learning with Optimal Statistical Rates and Privacy Guarantees. (arXiv:2205.11765v1 [cs.LG])
    We propose Byzantine-robust federated learning protocols with nearly optimal statistical rates. In contrast to prior work, our proposed protocols improve the dimension dependence and achieve a tight statistical rate in terms of all the parameters for strongly convex losses. We benchmark against competing protocols and show the empirical superiority of the proposed protocols. Finally, we remark that our protocols with bucketing can be naturally combined with privacy-guaranteeing procedures to introduce security against a semi-honest server. The code for evaluation is provided in https://github.com/wanglun1996/secure-robust-federated-learning.  ( 2 min )
    Throwing Away Data Improves Worst-Class Error in Imbalanced Classification. (arXiv:2205.11672v1 [stat.ML])
    Class imbalances pervade classification problems, yet their treatment differs in theory and practice. On the one hand, learning theory instructs us that \emph{more data is better}, as sample size relates inversely to the average test error over the entire data distribution. On the other hand, practitioners have long developed a plethora of tricks to improve the performance of learning machines over imbalanced data. These include data reweighting and subsampling, synthetic construction of additional samples from minority classes, ensembling expensive one-versus all architectures, and tweaking classification losses and thresholds. All of these are efforts to minimize the worst-class error, which is often associated to the minority group in the training data, and finds additional motivation in the robustness, fairness, and out-of-distribution literatures. Here we take on the challenge of developing learning theory able to describe the worst-class error of classifiers over linearly-separable data when fitted either on (i) the full training set, or (ii) a subset where the majority class is subsampled to match in size the minority class. We borrow tools from extreme value theory to show that, under distributions with certain tail properties, \emph{throwing away most data from the majority class leads to better worst-class error}.  ( 2 min )
    EBM Life Cycle: MCMC Strategies for Synthesis, Defense, and Density Modeling. (arXiv:2205.12243v1 [stat.ML])
    This work presents strategies to learn an Energy-Based Model (EBM) according to the desired length of its MCMC sampling trajectories. MCMC trajectories of different lengths correspond to models with different purposes. Our experiments cover three different trajectory magnitudes and learning outcomes: 1) shortrun sampling for image generation; 2) midrun sampling for classifier-agnostic adversarial defense; and 3) longrun sampling for principled modeling of image probability densities. To achieve these outcomes, we introduce three novel methods of MCMC initialization for negative samples used in Maximum Likelihood (ML) learning. With standard network architectures and an unaltered ML objective, our MCMC initialization methods alone enable significant performance gains across the three applications that we investigate. Our results include state-of-the-art FID scores for unnormalized image densities on the CIFAR-10 and ImageNet datasets; state-of-the-art adversarial defense on CIFAR-10 among purification methods and the first EBM defense on ImageNet; and scalable techniques for learning valid probability densities. Code for this project can be found at https://github.com/point0bar1/ebm-life-cycle.  ( 2 min )
    Generalization Gap in Amortized Inference. (arXiv:2205.11640v1 [stat.ML])
    The ability of likelihood-based probabilistic models to generalize to unseen data is central to many machine learning applications such as lossless compression. In this work, we study the generalizations of a popular class of probabilistic models - the Variational Auto-Encoder (VAE). We point out the two generalization gaps that can affect the generalization ability of VAEs and show that the over-fitting phenomenon is usually dominated by the amortized inference network. Based on this observation we propose a new training objective, inspired by the classic wake-sleep algorithm, to improve the generalizations properties of amortized inference. We also demonstrate how it can improve generalization performance in the context of image modeling and lossless compression.  ( 2 min )
    Identifying Patient-Specific Root Causes of Disease. (arXiv:2205.11627v1 [stat.ML])
    Complex diseases are caused by a multitude of factors that may differ between patients. As a result, hypothesis tests comparing all patients to all healthy controls can detect many significant variables with inconsequential effect sizes. A few highly predictive root causes may nevertheless generate disease within each patient. In this paper, we define patient-specific root causes as variables subject to exogenous "shocks" which go on to perturb an otherwise healthy system and induce disease. In other words, the variables are associated with the exogenous errors of a structural equation model (SEM), and these errors predict a downstream diagnostic label. We quantify predictivity using sample-specific Shapley values. This derivation allows us to develop a fast algorithm called Root Causal Inference for identifying patient-specific root causes by extracting the error terms of a linear SEM and then computing the Shapley value associated with each error. Experiments highlight considerable improvements in accuracy because the method uncovers root causes that may have large effect sizes at the individual level but clinically insignificant effect sizes at the group level. An R implementation is available at github.com/ericstrobl/RCI.  ( 2 min )
    Soft-SVM Regression For Binary Classification. (arXiv:2205.11735v1 [stat.ML])
    The binomial deviance and the SVM hinge loss functions are two of the most widely used loss functions in machine learning. While there are many similarities between them, they also have their own strengths when dealing with different types of data. In this work, we introduce a new exponential family based on a convex relaxation of the hinge loss function using softness and class-separation parameters. This new family, denoted Soft-SVM, allows us to prescribe a generalized linear model that effectively bridges between logistic regression and SVM classification. This new model is interpretable and avoids data separability issues, attaining good fitting and predictive performance by automatically adjusting for data label separability via the softness parameter. These results are confirmed empirically through simulations and case studies as we compare regularized logistic, SVM, and Soft-SVM regressions and conclude that the proposed model performs well in terms of both classification and prediction errors.  ( 2 min )
    DIGRAC: Digraph Clustering Based on Flow Imbalance. (arXiv:2106.05194v6 [stat.ML] UPDATED)
    Node clustering is a powerful tool in the analysis of networks. We introduce a graph neural network framework to obtain node embeddings for directed networks in a self-supervised manner, including a novel probabilistic imbalance loss, which can be used for network clustering. Here, we propose directed flow imbalance measures, which are tightly related to directionality, to reveal clusters in the network even when there is no density difference between clusters. In contrast to standard approaches in the literature, in this paper, directionality is not treated as a nuisance, but rather contains the main signal. DIGRAC optimizes directed flow imbalance for clustering without requiring label supervision, unlike existing graph neural network methods, and can naturally incorporate node features, unlike existing spectral methods. Extensive experimental results on synthetic data, in the form of directed stochastic block models, and real-world data at different scales, demonstrate that our method, based on flow imbalance, attains state-of-the-art results on directed graph clustering when compared against 10 state-of-the-art methods from the literature, for a wide range of noise and sparsity levels, graph structures and topologies, and even outperforms supervised methods.  ( 2 min )
    A Quadrature Rule combining Control Variates and Adaptive Importance Sampling. (arXiv:2205.11890v1 [stat.ML])
    Driven by several successful applications such as in stochastic gradient descent or in Bayesian computation, control variates have become a major tool for Monte Carlo integration. However, standard methods do not allow the distribution of the particles to evolve during the algorithm, as is the case in sequential simulation methods. Within the standard adaptive importance sampling framework, a simple weighted least squares approach is proposed to improve the procedure with control variates. The procedure takes the form of a quadrature rule with adapted quadrature weights to reflect the information brought in by the control variates. The quadrature points and weights do not depend on the integrand, a computational advantage in case of multiple integrands. Moreover, the target density needs to be known only up to a multiplicative constant. Our main result is a non-asymptotic bound on the probabilistic error of the procedure. The bound proves that for improving the estimate's accuracy, the benefits from adaptive importance sampling and control variates can be combined. The good behavior of the method is illustrated empirically on synthetic examples and real-world data for Bayesian linear regression.  ( 2 min )
    Weakly-supervised Multi-output Regression via Correlated Gaussian Processes. (arXiv:2002.08412v2 [stat.ML] UPDATED)
    Multi-output regression seeks to borrow strength and leverage commonalities across different but related outputs in order to enhance learning and prediction accuracy. A fundamental assumption is that the output/group membership labels for all observations are known. This assumption is often violated in real applications. For instance, in healthcare datasets, sensitive attributes such as ethnicity are often missing or unreported. To this end, we introduce a weakly-supervised multi-output model based on dependent Gaussian processes. Our approach is able to leverage data without complete group labels or possibly only prior belief on group memberships to enhance accuracy across all outputs. Through intensive simulations and case studies on an Insulin, Testosterone and Bodyfat dataset, we show that our model excels in multi-output settings with missing labels, while being competitive in traditional fully labeled settings. We end by highlighting the possible use of our approach in fair inference and sequential decision-making.  ( 2 min )
    Advanced Manufacturing Configuration by Sample-efficient Batch Bayesian Optimization. (arXiv:2205.11827v1 [cs.LG])
    We propose a framework for the configuration and operation of expensive-to-evaluate advanced manufacturing methods, based on Bayesian optimization. The framework unifies a tailored acquisition function, a parallel acquisition procedure, and the integration of process information providing context to the optimization procedure. The novel acquisition function is demonstrated and analyzed on benchmark illustrative problems. We apply the optimization approach to atmospheric plasma spraying in simulation and experiments. Our results demonstrate that the proposed framework can efficiently find input parameters that produce the desired outcome and minimize the process cost.  ( 2 min )
    Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control. (arXiv:2205.11956v1 [stat.ML])
    Most machine learning methods depend on the tuning of hyper-parameters. For kernel ridge regression (KRR) with the Gaussian kernel, the hyper-parameter is the bandwidth. The bandwidth specifies the length-scale of the kernel and has to be carefully selected in order to obtain a model with good generalization. The default method for bandwidth selection is cross-validation, which often yields good results, albeit at high computational costs. Furthermore, the estimates provided by cross-validation tend to have very high variance, especially when training data are scarce. Inspired by Jacobian regularization, we formulate how the derivatives of the functions inferred by KRR with the Gaussian kernel depend on the kernel bandwidth. We then use this expression to propose a closed-form, computationally feather-light, bandwidth selection method based on controlling the Jacobian. In addition, the Jacobian expression illuminates how the bandwidth selection is a trade-off between the smoothness of the inferred function, and the conditioning of the training data kernel matrix. We show on real and synthetic data that compared to cross-validation, our method is considerably more stable in terms of bandwidth selection, and, for small data sets, provides better predictions.  ( 2 min )
    uGLAD: Sparse graph recovery by optimizing deep unrolled networks. (arXiv:2205.11610v1 [cs.LG])
    Probabilistic Graphical Models (PGMs) are generative models of complex systems. They rely on conditional independence assumptions between variables to learn sparse representations which can be visualized in a form of a graph. Such models are used for domain exploration and structure discovery in poorly understood domains. This work introduces a novel technique to perform sparse graph recovery by optimizing deep unrolled networks. Assuming that the input data $X\in\mathbb{R}^{M\times D}$ comes from an underlying multivariate Gaussian distribution, we apply a deep model on $X$ that outputs the precision matrix $\Theta$, which can also be interpreted as the adjacency matrix. Our model, uGLAD, builds upon and extends the state-of-the-art model GLAD to the unsupervised setting. The key benefits of our model are (1) uGLAD automatically optimizes sparsity-related regularization parameters leading to better performance than existing algorithms. (2) We introduce multi-task learning based `consensus' strategy for robust handling of missing data in an unsupervised setting. We evaluate model results on synthetic Gaussian data, non-Gaussian data generated from Gene Regulatory Networks, and present a case study in anaerobic digestion.  ( 2 min )
    Quasi Black-Box Variational Inference with Natural Gradients for Bayesian Learning. (arXiv:2205.11568v1 [stat.ML])
    We develop an optimization algorithm suitable for Bayesian learning in complex models. Our approach relies on natural gradient updates within a general black-box framework for efficient training with limited model-specific derivations. It applies within the class of exponential-family variational posterior distributions, for which we extensively discuss the Gaussian case for which the updates have a rather simple form. Our Quasi Black-box Variational Inference (QBVI) framework is readily applicable to a wide class of Bayesian inference problems and is of simple implementation as the updates of the variational posterior do not involve gradients with respect to the model parameters, nor the prescription of the Fisher information matrix. We develop QBVI under different hypotheses for the posterior covariance matrix, discuss details about its robust and feasible implementation, and provide a number of real-world applications to demonstrate its effectiveness.  ( 2 min )

  • Open

    [D] What kind of motherboard do I need to run 2x 3090?
    Hey guys im a designer wanting to dip my toes into machine learning/Ai. 1- Is a mATX motherboard with 1x pciex16 and 1x pciex4 good enough for dual 3090? 2- do I absolutely need to use nvlink between these 3090s? what are the pros and cons? submitted by /u/Aeonbreak [link] [comments]  ( 1 min )
    [D] Google Speech to Text vs Building similar capability in house
    Hey All. I hope you're well. TL;DR I'm trying to compare the costs of building vs using existing technology For those of you who have experience building speech or signal processing ML / AI? I'm looking into building a SaaS feature that require speech to text. As an MVP, we've been using Google's speech to text API which is great - with a few problems - it's cost, and accuracy. While its accuracy is quite high, its cost, if I'm calculating correctly, relatively to our pricing, would be quite high. There are also benefits in terms of building accurate models for the specific industry we'd be using it. Does anyone have any examples of what it would take to build and operate something like that (costs / number of engineers) and more importantly operating (let's say - cost per minute to in computing power / storage) submitted by /u/Mobile_Jacket_894 [link] [comments]  ( 2 min )
    [D] Seeking advice for MLDS beginners
    I have an advanced degree in math, and am familiar with Python (from my college days). When I try tutorials for Sagemaker or Colab, it feels like I make progress, but spend 90% of my time trying to research why something is broken or not working. So, my question - is this the nature of this industry and software platforms at the moment? Are there any other tools/platforms that don't require constant troubleshooting for beginners? Thanks! submitted by /u/evergrowbro [link] [comments]  ( 1 min )
    [D] Training denoising diffusion model when forward process is unknown but we have multiple examples for each step?
    I hope the title at least kinda makes sense? More details below. ​ The way I understand denoising diffusion models on a high level is we simulate the forward process with a known distribution like gaussian noise and then learn the reverse process. I am wondering if it is reasonable to try training a diffusion model when we have a dataset from which we can sample any step of the forward process, but the actual forward process itself is unknown. Basically we have a training target for each "latent" in the diffusion process. ​ For the sake of simplicity we can think of our dataset as "images" and we can sample any image from the forward process. E.g we can sample the lowest quality images which are just random noise, medium quality images which contain the information we are interested in but obscured by noise or missing signal, or we can sample the highest quality images which is the final target of the model. We want a model which can map low/medium quality data to denoised high quality data; basically we are doing a super-resolution or image denoising task. We can generate any level of lower quality data we want, but the actual pixel-level difference between steps in the forward process is much more complex than just additive gaussian noise. ​ I already know it is *technically* feasible to train a diffusion model on this dataset since I am already testing it with varying levels of success. More what I am asking is if this makes any theoretical sense and where this might fit in with current literature on these models. Is it sufficient to just train an image-to-image model like a Unet to map each sample to the next in the markov chain since it might not be possible to evaluate KL divergence wrt the forward process? Or is there some way to approximate this as well? How is this any different from just training a stacked denoising autoencoder? ​ Thanks for any insight and/or discussion! submitted by /u/dimsycamore [link] [comments]  ( 2 min )
    [P]Made a community for freelancing in ML field
    Currently, freelancing in AI is still a niche and has a cold start problem no matter if you are freelancing yourself or being on a platform such as upwork, fivrr, etc. ​ Join the community and lets explore the alternative career path. https://www.reddit.com/r/ai_freelancing/ submitted by /u/meame2010 [link] [comments]  ( 1 min )
    [D] How does the accuracy of large language models compare to average humans? How does GPT-NeoX compare?
    I saw a graph comparing GPT-3, UnifiedQA and Gopher to Human Expert in level of accuracy on various fields. https://piped.kavin.rocks/watch?v=aPiHhJjN3hI&t=3m55s I was curious to know how does GPT-NeoX compare. And how do average humans perform? submitted by /u/ConsistentSense4760 [link] [comments]  ( 1 min )
    [P] Official Imagen Website by Google Brain
    https://imagen.research.google/ submitted by /u/margilly_ai [link] [comments]  ( 1 min )
    [Discussion] Is missing data still a problem?
    I'm curious to hear peoples' perspectives on this. As I understand it, the easiest and one of the most popular ways of "handling" missing data is to just throw away incomplete observations. Would spending time developing methods that can learn from incomplete observations still be useful? Or would you argue that, because there are so many settings in which data is so plentiful, it's perfectly fine at this point to just throw away incomplete observations? submitted by /u/vandelay_inds [link] [comments]  ( 1 min )
    [D] Visualizing loss surface in input space
    I am trying to recreate the surfaces from this paper "Interpreting Adversarial Robustness: A View from Decision Surface in Input Space" https://arxiv.org/abs/1810.00144 They briefly mention the reasoning for using input space instead of parameter space in section 2.2 but don't describe how they produce such surfaces in detail. I understand the conventional approach (http://arxiv.org/abs/1712.09913) of projecting into a 2D hyperplane with 2 random (normalized) vectors in parameter space, but struggle to think of a way to do this with input images (e.g. CIFAR10 dataset). I have also failed to find any implementations of this approach. Any advice? submitted by /u/sarfins [link] [comments]  ( 1 min )
    [P] Introducing BlindAI, an Open-source, fast and privacy-friendly AI deployment solution. Benefit from state-of-the-art AI without ever revealing your data!
    Hello everyone, We are pleased to introduce BlindAI to the AI community. BlindAI is an AI deployment solution, leveraging secure enclaves, to make remotely hosted AI models privacy friendly. Please have a look at our GitHub (https://github.com/mithril-security/blindai) to find out more! Motivation Today, most AI tools offer no privacy by design mechanisms, so when data is sent to be analysed by third parties, the data is exposed to malicious usage or potential leakage. We illustrate it below with the use of AI for voice assistants. Audio recordings are often sent to the Cloud to be analysed, leaving conversations exposed to leaks and uncontrolled usage without users’ knowledge or consent. ​ Before and after BlindAI By using BlindAI, data remains always protected as it is only decrypted inside a Trusted Execution Environment, called an enclave, whose contents are protected by hardware. While data is in clear inside the enclave, it is inaccessible to the outside thanks to isolation and memory encryption. This way, data can be processed, enriched, and analysed by AI, without exposing it to external parties. What you can do We have been able to run several state of the art models with privacy guarantees, enabling us to tackle complex scenarios, from privacy-friendly voice assistant with Wav2vec2, to confidential chest X-Ray analysis with ResNet, through document analysis with BERT. All of these models have been tested and can run with end-to-end protection under a second on an Intel(R) Xeon(R) Platinum 8370C. Model name Example use case Inference time (ms) DistilBERT Sentiment analysis 28.435 Wav2vec2 Speech to text 617.04 Facenet Facial recognition 47.135 A more detailed list of models we can deploy with privacy, with their run time, can be found here. If you like it drop a ⭐on our GitHub (https://github.com/mithril-security/blindai)! submitted by /u/Separate-Still3770 [link] [comments]  ( 1 min )
    [D] Best GP book
    Hi! As part of my research I’m using Gaussian Process (regression) but want to understand GPS more fully. Does anyone have any book recommendations that explain them well, either whole books on them or part of wider machine learning textbooks. Thanks submitted by /u/Ikeyt [link] [comments]  ( 1 min )
    [D] Deep Sets and Attention, what's the difference?
    I have stumbled across the relevant literature on neural networks for sets, e.g Deep Sets, PointerNet, and subsequent works. These architectures look similar to attention mechanisms and related models (e.g. Transformers), but they are treated as a very separated thing in the literature (e.g. in the Deep Sets paper the word "attention" never appears). So, I wonder, what is the main conceptual difference between Deep Sets and attention? I'm not very experienced with attention mechanisms, so I might be missing something in the big picture here. submitted by /u/fedetask [link] [comments]  ( 1 min )
    [D] Recent research and methods for time series forecasting
    Recent advances in Vision and NLP are dominating the AI community at the moment. I have been trying to find out if something exciting has been done for time series forecasting recently (last five years or so). Looking for some good starting points to keep up with the latest research. Any pointers would be highly appreciated. submitted by /u/ndalal01 [link] [comments]  ( 1 min )
    [D] What are some of the better one shot learning techniques for time series data?
    I am working on a project involving one shot learning for time series data from dynamical systems. Which would be some of the better ML models, I should look at for this purpose? submitted by /u/_hereforthecomments [link] [comments]  ( 1 min )
    [P] How to quantify the similarity/dissimilarity between two time series datasets?
    Hey guys, I was working with time series dataset for dynamical systems. I need to quantify how similar the datasets from two different dynamical systems are. So what would be the optimal parameter to do so? submitted by /u/_hereforthecomments [link] [comments]  ( 1 min )
    [D] finding the paper I want in the CVPR 2022.
    Hello guys, When I was finding for papers in the conference, object detection, I was surprised at the fact that there were above of 2000 papers. I usually read papers through a section of abstract, but this time can I only just open the papers to check out that my interest topics is there? Plz let me know. If I have to do that, I will. Thx. :) submitted by /u/Mundane_Definition_8 [link] [comments]  ( 1 min )
    [P] What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)
    TL;DR We made autoregressive transformer based models like T5-large 2X faster than 🤗 Hugging Face Pytorch with 3 simple tricks: storing 2 computation graphs in a single Onnx file 👯: this let us have both cache and no cache support without having any duplicated weights. When cache is used, attention switch from quadratic to linear complexity (less GPU computation) and Onnx Runtime brings us kernel fusion (less memory bound ops); zero copy 💥 to retrieve output from Onnx Runtime: we leverage Cupy API to access Onnx Runtime internal CUDA arrays and expose them through Dlpack to Pytorch. It may sound a bit complex, but it let us avoid output tensors copy which limit our memory footprint and make us much faster (check notebook for other benefits of this approach); a generic tool to conv…  ( 4 min )
    [D] Calculating Shannon Information of Data Augmentation Strategies
    I recently caught Andrew Ng's 2021 talk on MLOps (MLOps: From Model-centric to Data-centric AI). At 26:40, he talks about calculating the effectiveness of cleaning your data (training examples) vs. collecting new examples. Apparently this is possible to calculate using Shannon Information. As an example, Ng notes that "[if you have] 500 pictures of iguanas and if 12% of the labels are noisy (incorrectly labelled), then from a Shannon Information calculation, you can show that [cleaning up the noisy data and collecting another 500 new examples] are about equally effective.". Does anyone know how exactly this might be calculated? I've tried to find more information on this but without much luck. Intuitively, it seems related to Shannon's Limit but the exact application is unclear to me. Thanks. Edit: Cross-posted to Cross Validated submitted by /u/Academy- [link] [comments]  ( 3 min )
    [D] What are the environmental effects of using a pre-trained ML model powered by a GPU?
    The Internet has a lot of information concerning the environmental effects of training an ML model with GPUs. But less is available in terms of the emissions generated by running a trained model on a GPU. For example, Google has a language model called BERT. If I was to deploy it on a web server powered by a GPU, would the emissions be close to nil or would it actually harm the environment? submitted by /u/Alternative-Pause-14 [link] [comments]  ( 1 min )
    [D] ANN architecture for decoupling Signals
    I have read multiple papers and they never attach their source code like here https://www.science.org/doi/10.1126/scirobotics.abc6878 My problem is I have no idea how to apply ANN for decoupling different measured signals. I think I am not really capable to digest what these papers did, I would appreciate if someone can help me in understanding how to implement these algorithms from scratch. ​ Thanks submitted by /u/meldiwin [link] [comments]  ( 1 min )
  • Open

    DSC Weekly 24 May 2022: Is an AI Autumn Around the Corner?
    It is hard, looking at the current technological landscape, to believe that artificial intelligence may actually be facing a reckoning that could cause investment in the field to dry up. If you go by the press releases and even the many products that supposedly incorporate artificial-intelligence-oriented products, it would seem that computers capable of thinking… Read More »DSC Weekly 24 May 2022: Is an AI Autumn Around the Corner? The post DSC Weekly 24 May 2022: Is an AI Autumn Around the Corner? appeared first on Data Science Central.  ( 6 min )
    Top Ways in Which Data Science Improves E-Commerce Sales
    Data science refers to the use of algorithms, systems, technology, etc. to glean insights from data of all kinds. And, Machine learning (ML) and artificial intelligence (AI) make it possible for the shoppers with predictions based on what they like even before they decide to look for a specific product offering. Data science has been… Read More »Top Ways in Which Data Science Improves E-Commerce Sales The post Top Ways in Which Data Science Improves E-Commerce Sales appeared first on Data Science Central.  ( 3 min )
    Why is the Gig Economy a New Future?
    The expression “Gig Economy” alludes to an unrestricted economy framework where project work, term agreements, and brief positions are pervasive. The associations will enlist autonomous consultants or experts for momentary responsibilities. The expression “gig” is a shoptalk word for a task that endures a predefined period that was first made well known by artists or… Read More »Why is the Gig Economy a New Future? The post Why is the Gig Economy a New Future? appeared first on Data Science Central.  ( 4 min )
    Lessons Learned from Writing My First Python Script
    After 25 Years of Coding in C And Perl. As an independent author/researcher, there is of course nothing in my “job description” that says I should code in Python (or any other language). Yet for a long time, I thought coding in Python would help me a lot. It would mean more readers, and thus… Read More »Lessons Learned from Writing My First Python Script The post Lessons Learned from Writing My First Python Script appeared first on Data Science Central.  ( 6 min )
    Russian Troll Detection By Their Tweets
    This project for me was personal. I experienced the propaganda machine of the Soviet Union and am horrified to see it used on Americans. As a young adult in Soviet Russia, I succumbed to brainwashing and had no idea what was really going on. “Everybody always lies” had been the norm. I came to the… Read More »Russian Troll Detection By Their Tweets The post Russian Troll Detection By Their Tweets appeared first on Data Science Central.  ( 4 min )
    Importance of Fearlessness to Exploit the Potential of AI
    “Courage is not the absence of fear, but rather the judgment that something else is more important than fear.  The brave may not live forever, but the cautious do not live at all”– Philippe Renaldi, The Princess Diaries (2001) Optimizing versus innovating…the simple difference between surviving versus thriving. Yes, it is always easier to apply… Read More »Importance of Fearlessness to Exploit the Potential of AI The post Importance of Fearlessness to Exploit the Potential of AI appeared first on Data Science Central.  ( 7 min )
  • Open

    What is the current SOTA for Online RL?
    Hi everyone! Been reading a lot about RL and was wondering what the current SOTA models are for online RL.I have noticed that more and more models are starting to use transformers. Is this also where SOTA online RL models are going? Any ideas / oppinions about what the future direction of this field is would also be highly appreciated. Cheers submitted by /u/Idonai [link] [comments]  ( 1 min )
    Do you know any environments in which both the direction and the position of the agent are tracked?
    In the simple spread env, only position is tracked (https://github.com/openai/multiagent-particle-envs/blob/47e9ee38e605f8a563370b3c7e52a349eca3f6b1/multiagent/scenarios/simple_spread.py#L40) submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    How to make game based RL research paper worthy?
    I am completely new to RL, I am experienced with NN. I decided to do my thesis on RL to take it as learning opportunity but Im beginning to find it quite intimidating to create a sim environment and apply RL given Im inexperieced with RL and have little time provided for thesis. Alternatively, I am suggested games which is also fun to work and to learn but Im struggling to comeup with a thesis statement such that I can use for publication as well. Some games I have considered are Go, Bomberland(Bomberman knock off from coderone), Kore (https://www.kaggle.com/competitions/kore-2022). I am also open to otherwise suggestions. submitted by /u/npc1111 [link] [comments]  ( 1 min )
    Exploration strategy for aerial robotics.
    In the continuous action tasks where a slight exploration can impact the environment drastically (like aerial robots), what exploration strategy can be used to safely explore the action space. I am currently using DDPG algorithm for position tracking in drones. I use Gaussian noise for exploration. The rewards for this tracking task are sparse as large negative rewards can make training unstable. So what can I use as the exploration strategy in this scenario. submitted by /u/Better-Ad8608 [link] [comments]  ( 1 min )
    Is DQN capable of 'solving' random dungeon traversal of unknown length and start/end positions?
    I'm interested in implementing DQN for a dungeon crawler I play. You are given a 2d map with your position as the central point and you need to traverse to the next zone, the map is limited in scope and is slowly revealed as you move along. There is a map based marker for the entrance to the next zone. ​ Since it is a dungeon of random size and random end/start positions, with no method to generate a reward until the agent gets to the next zone (ie the max overall reward is 1) is it possible for the agent to learn a policy in this scenario? submitted by /u/IFartedAndMyDickHurt [link] [comments]  ( 2 min )
    more thread, train crash
    In one computer, use ppo on swimmer. with 16 threads, the agent trains crash after 1 hour. with less than 16 threads, the agent trains smoothly and well https://preview.redd.it/fivxnkg11c191.png?width=727&format=png&auto=webp&s=4047c1b516d514619f0b73e38efc38d04bab5135 submitted by /u/OkSkirt5714 [link] [comments]  ( 1 min )
  • Open

    What is Real in a World of Social Constructs
    In today’s episode of Future Tech. I ask the question, what is real? Because in this world of social constructs, everything is created by… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 7 min )
    Drupal Commerce: Why is it Ideal for Your E-Commerce Business
    The world of e-commerce is undeniably among the most lucrative sectors in the world at the moment and is expected to remain so for the…  ( 2 min )
  • Open

    The hype around DeepMind’s new AI model misses what’s actually cool about it
    submitted by /u/KelliaMcclure [link] [comments]  ( 1 min )
    In this tutorial, we show how to automate job search using fine-tuned NER model. Checkout the article to learn more!
    submitted by /u/UBIAI [link] [comments]
    Questionnaire for my dissertation on the subject of A.I
    Hey guys! I'm a student and I'm currently working on my dissertation for University. I'm using this as a way of collecting data on the representation of AI in movies and pop culture and I'd appreciate the responses! Here's the link: https://forms.gle/1jrzrfuSd3rFD6A17 submitted by /u/ZakUllah [link] [comments]
    Working on cognitive control in an artificial cognitive entity - when a machine can think about anything, how do you control what it thinks about and why?
    submitted by /u/DavidKShapiro [link] [comments]  ( 1 min )
    Saiyan transformation
    submitted by /u/Due-Ad9795 [link] [comments]
    MELODIES POSITIVE: Flowers from coconut cap.
    submitted by /u/cookingandcraft [link] [comments]
    6 Best Artificial Intelligence courses for Healthcare You should learn 2022
    submitted by /u/maneesh123456 [link] [comments]
    Unsupervised Tokenization Learning
    submitted by /u/akolonin [link] [comments]  ( 1 min )
    I can't wait for Dall E 2, but this will do for now!
    submitted by /u/BeginningRealistic49 [link] [comments]
    Google Brain's new model Imagen is incredible!
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    Fast.ai vs statistics.com?
    I’m currently doing a math and cs bachelors. I want to move into AI research (Something that is applicable currently but also contributes to the AGI field ideally) in the near future. Which site/resources should I use to learn relevant knowledge? Are there better ones? submitted by /u/BackgroundSense351 [link] [comments]  ( 1 min )
  • Open

    Image-Text Pre-training with Contrastive Captioners
    Posted by Zirui Wang and Jiahui Yu, Research Scientists, Google Research, Brain Team Oftentimes, machine learning (ML) model developers begin their design using a generic backbone model that is trained at scale and with capabilities transferable to a wide range of downstream tasks. In natural language processing, a number of popular backbone models, including BERT, T5, GPT-3 (sometimes also referred to as “foundation models”), are pre-trained on web-scale data and have demonstrated generic multi-tasking capabilities through zero-shot, few-shot or transfer learning. Compared with training over-specialized individual models, pre-training backbone models for a large number of downstream tasks can amortize the training costs, allowing one to overcome resource limitations when building large s…  ( 7 min )
  • Open

    Powering Next Generation Applications with OpenAI Codex
    Codex is now powering 70 different applications across a variety of use cases through the OpenAI API.  ( 3 min )
  • Open

    Is there any open public API like thispersondoesnotexist.com where I can specify gender and age?
    submitted by /u/AxelTheRabbit [link] [comments]
    How to fix cardinality error in my CNN model
    I am currently working on a CNN model where my inputs are CSV's with 1080 rows and 4 attributes(collumns). I have all the CSVs under their own categories directory, for example /Category A/..., /Category B/... etc. At start I have created two arrays: X = [] Y = [] and then in a for loop going thru all the directories I read the contents of the CSV in a `1080x4` shape and put it in my X array like this: (I have confirmed the values are correctly read and in correct shape) X.append(pandas.read_csv(itemPath).values) and I add the category of the item in my Y array right after, so the order of `X` items are alligned with order of `Y` items (categories of X values). Y.append(cat) Then here is my model, albeit I got it mostly from a Kaggle example, I just wanted to see if it works …  ( 4 min )
  • Open

    NVIDIA Brings Data Center, Robotics, Gaming, Content Creation Innovations to COMPUTEX
    Digital twins that revolutionize the way the most complex products are produced. Silicon and software that transforms data centers into AI factories. Gaming advances that bring the world’s most popular games to life. Taiwan has become the engine that brings the latest innovations to the world. So it only makes sense that NVIDIA leaders brought Read article > The post NVIDIA Brings Data Center, Robotics, Gaming, Content Creation Innovations to COMPUTEX appeared first on NVIDIA Blog.  ( 7 min )
    NVIDIA Adds Liquid-Cooled GPUs for Sustainable, Efficient Computing
    In the worldwide effort to halt climate change, Zac Smith is part of a growing movement to build data centers that deliver both high performance and energy efficiency. He’s head of edge infrastructure at Equinix, a global service provider that manages more than 240 data centers and is committed to becoming the first in its Read article > The post NVIDIA Adds Liquid-Cooled GPUs for Sustainable, Efficient Computing appeared first on NVIDIA Blog.  ( 4 min )
    NVIDIA Partners Announce Wave of New Jetson AGX Orin Servers and Appliances at COMPUTEX
    More than 30 leading technology partners worldwide announced this week the first wave of NVIDIA Jetson AGX Orin-powered production systems at COMPUTEX in Taipei. New products are coming from a dozen Taiwan-based camera, sensor and hardware providers for use in edge AI, AIoT, robotics and embedded applications. Available worldwide since GTC in March, the NVIDIA Read article > The post NVIDIA Partners Announce Wave of New Jetson AGX Orin Servers and Appliances at COMPUTEX appeared first on NVIDIA Blog.  ( 3 min )
    Master of Arts: NVIDIA RTX GPUs Accelerate Creative Ecosystems, Delivering Unmatched AI and Ray-Tracing Performance
    The future of content creation was on full display during the virtual NVIDIA keynote at COMPUTEX 2022, as the NVIDIA Studio platform expands with new Studio laptops and RTX-powered AI apps — all backed by the May Studio Driver released today. The post Master of Arts: NVIDIA RTX GPUs Accelerate Creative Ecosystems, Delivering Unmatched AI and Ray-Tracing Performance appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    President Guðni Thorlacius Jóhannesson of Iceland visits MIT
    Delegation meets campus leaders, with an eye toward AI applications and the Icelandic language.  ( 6 min )

  • Open

    Our own stupidity might be an advantage for AI
    We are just smart enough to deal with problems logically, but too stupid to understand certain information hazards. Our "just good enough" understanding could be used where AI can't go submitted by /u/burner557799 [link] [comments]
    What are the opereting systems with the most ai technology build-in right now?
    submitted by /u/Scared_Assistance_28 [link] [comments]
    New Qualcomm RB6 AI Robot Cloud Accelerator Has Quad Core 2.85Ghz @ 200 Trillion Operations Per Second | Breakthrough Neural Network TPU Architecture
    submitted by /u/SlightSituation [link] [comments]  ( 1 min )
    When AI Meets the Transgender Community
    submitted by /u/punkthesystem [link] [comments]  ( 1 min )
    Meta's MyoSuite — An embodied AI platform that unifies neural and motor intelligence
    submitted by /u/SpatialComputing [link] [comments]  ( 1 min )
    Stanford And Oxford Researchers Propose An Approach To Relate Transformers To Models And Neural Representations Of The Hippocampal Formation
    In recent years, a significant part of neuroscience research has focused on relating deep learning architectures to the human brain, and many deep learning (DL) techniques have recently been shown to replicate neural firing patterns observed in the brain. For example, representations of convolutional neural networks have been shown to predict neurons in the visual cortex and inferior temporal cortex, while recurrent neural networks have been shown to recapitulate grid cells in the medial entorhinal cortex. The ability to use machine learning models to predict brain representations allows for a deeper understanding of the mechanistic computations of the respective brain areas and a deeper understanding of the nature of the models. However, one of the most exciting and promising new architec…  ( 2 min )
    Spring (A. I. animation + sound design)
    submitted by /u/nenomancer [link] [comments]  ( 1 min )
    Frame interpolation that allows for multiple sources?
    I'm working with material that was originally mastered in 29.97fps, then converted to 25fps for PAL territories. The PAL source is the best looking copy, in terms of detail, fewer digital artifacts, resolution, and colour - but it's obviously missing frames. Rather than just using AI to create new frames, is there a tool that will allow me to introduce multiple sources so the AI knows what those missing frames should best look like? submitted by /u/Plebsolute [link] [comments]  ( 1 min )
    Interesting AI
    Hello, I plan on giving a TEDx talk on AI because I love AI and I have spent a lot of time learning AI, Machine Learning, and Deep Learning (with all the math). I thought of talking about interesting AI that currently exist but there is many so I can't decide. I would love to hear your guys' ideas on what is your favourite and the most interesting AI you have seen which I could talk about. All responses are appreciated Thanks submitted by /u/No-Conversation8169 [link] [comments]  ( 1 min )
    Superhuman AI - Artificial intelligence predicts patients’ race from their medical images
    submitted by /u/qptbook [link] [comments]
    Machine learning has a backdoor problem
    submitted by /u/bendee983 [link] [comments]
    The StatQuest Illustrated Guide To Machine Learning eBook
    The StatQuest Illustrated Guide To Machine Learning submitted by /u/Futureisnotsecure [link] [comments]
    In The Latest AI Research, CMU And Adobe Researchers Propose An Elegant Emsembling Mechanism For GAN Training That Improves FID by 1.5x to 2x On The Given Dataset
    Image generation necessitates the ability to capture and model complicated statistics in real-world visual events. When trained on large-scale data, computer vision models have shown adept at capturing valuable representations, thanks to the effectiveness of supervised and self-supervised learning techniques. Surprisingly, despite the aforementioned link between synthesis and analysis, state-of-the-art generative adversarial networks (GANs) are trained without the use of such pre-trained networks in an unsupervised way. This is a squandered opportunity to investigate, given the abundance of relevant models readily available in the research ecosystem. Adobe researchers investigated the usage of a collection of pretrained deep feature extractors to aid in generative model training in a recent publication. GANs are specifically taught with a discriminator and a generator, which are both targeted at continuously learning the relevant statistics that distinguish genuinely and produced data. Continue Reading Paper: https://arxiv.org/pdf/2112.09130.pdf Github: https://github.com/nupurkmr9/vision-aided-gan Project: https://www.cs.cmu.edu/\~vision-aided-gan/ https://i.redd.it/y33lpsbpg6191.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Flower pot.
    submitted by /u/cookingandcraft [link] [comments]
    15 Machine Learning Project (End to End)
    Hi Guys, Free tutorial on Machine Learning Project (End to End) in Apache Spark and Scala with Code and Explanation Machine Learning Pipeline Application on Power Plant. Build Movies Recommendation Engine Sales Prediction or Sale Forecast Mushroom Classification whether it’s edible or poisonous Predict Forest Cover Predict Will it Rain Tomorrow in Australia Customer Segmentation using Machine Learning Predict Ads Click (93% Accuracy) Prediction task is to determine whether a person makes over 50K a year Classifying gender based on personal preferences Mobile Price Classification Predicting the Cellular Localization Sites of Proteins in Yest YouTube Spam Comment Prediction Identify the Type of animal (7 Types) based on the available attributes Glass Identification Predicting the age of abalone from physical measurements I hope you'll enjoy these tutorials. submitted by /u/bigdataengineer4life [link] [comments]  ( 1 min )
    You are water-based, not silicon-based, life. Be like water.
    The late Terence McKenna, in one of his many talks, pointed out the two choices on the menu when it comes to why we are here. One is the Christian belief that God created the world in six days. The other one that a Big Bang happened some billions years ago and the universe sprang out of nothingness - as a random happening. His point was this: Both of these explanations are utterly improbable. Still if you hold the Christian belief, then you do not question it and hold it as "the truth". Likewise if you believe in the Big Bang explanation you hold that as truth. From other cultures we learn that the world actually resides on the back of a giant turtle. Sounds crazy right? .. except if you are brought up with that belief. Could the important thing be WHO YOU BECOME by believing a given…  ( 2 min )
  • Open

    [P] Imagen: Latest text-to-image generation model from Google Brain!
    Imagen - unprecedented photorealism × deep level of language understanding Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Human raters prefer Imagen over other models (such as DALL-E 2) in side-by-side comparisons, both in terms of sample quality and image-text alignment. https://gweb-research-imagen.appspot.com/ https://gweb-research-imagen.appspot.com/paper.pdf submitted by /u/aifordummies [link] [comments]  ( 1 min )
    [D] ECCV 2022 Reviews
    Now ECCV 2022 reviews are out, what is your general feeling about the quality of reviews? submitted by /u/margilly_ai [link] [comments]  ( 1 min )
    [Discussion] Natural Language as an intermediate link for solving not-NLP machine learning problems
    Hello. Recently I've stumbled upon the following RL paper: https://arxiv.org/abs/2202.08938 , and it made me super curious. In their experiments, an agent acts in the setup of some simple video game, and they use textual descriptions of the environment as an additional information for this agent, to help it to perform a novelty search more efficiently, because textual descriptions express real novelty of a new environment more meaningfully than raw data. As they write: "we explore natural language as a general medium for highlighting relevant abstractions in an environment". After reading it, I started to be curious, whether there are any other papers, where Natural Language descriptions are used as an intermediate tool to help Machine Learning model to perform some not-NLP task? Maybe someone even tried to teach NLP model to generate the most useful hints for not-NLP model to solve not-NLP tasks? I would be grateful to learn any info about such research direction. Thanks! submitted by /u/KushnarevaL [link] [comments]  ( 1 min )
    [D]I want to compare train of a CNN on a dataset A vs pretraining on another bigger dataset B + finetuning dataset on dataset A. Should I use the same learning rate for both experiments, or use a diminished learning rate for finetuning?
    I have read that it is a common place to use a reduced learning rate when finetuning on a new dataset for transfer learning. On the other hand it seems to me not a good practice to do that and compare with the simple training that was performed with a larger training rate. submitted by /u/TheManveru [link] [comments]  ( 1 min )
    [D] Scaling sentence embeddings for PCA
    I’m doing PCA for different sentence embeddings (word2vec, BERTtweet, InferSent…) of my data. My question is, should I scale these embeddings before putting them into PCA. I know it’s a standard practice in ML when using PCA, but idk if it still stands for sentence embeddings. Also if I should, will standard scaler be ok? submitted by /u/No_Technology1455 [link] [comments]  ( 1 min )
    [D] What's the intuition behind certain CNN architectures?
    My knowledge about CNNs is relatively basic. I know what they do, but I have not enough experience to understand the different choices in architectures (besides the obvious improvements, like a residual block). For what I do I never needed to rely on CNNs, so whenever I worked with them I just used some existing CNN structures, but now I would like to learn more about it, so that I can optimize my current models. So for example looking at some CNNs used in reinforcement learning, here's one in particular (the CNN from the Atari self-learning Nature article): self.base = nn.Sequential( init_(nn.Conv2d(4, 32, kernel_size=8, stride=4, padding=0)), nn.ReLU(), init_(nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=0)), nn.ReLU(), init_(nn.Conv2d(64, 32, kernel_size=3, stride=1, padding=0)), nn.ReLU(), Flatten(), init_(nn.Linear(32 * 7 * 7, outputs)), nn.ReLU() ) The input is a stack of 4 frames with 84x84 grayscale pixels. I understand that you'd want to divide this image into many different smaller fields, but what's the intuition of doing it in such a way, e.g. 3 layers of CNNs? Why not 5? Why not 2 or 1, but with more outputs? In fact, it seems to me that a parallel approach with different parameters, instead of a sequential approach would be superior, since it seems to me that information would get lost after the first layer that uses a kernel size of 8. Are there any resources that you could recommend and that delve deeply into the nuances when creating CNN architectures? Thank you submitted by /u/NikEy [link] [comments]  ( 7 min )
    AI book reading platform using machine learning [P]
    Greetings Folks, In the past year, we had released a book reading AI tool to search for content within files using natural search, and we had received constructive feedback from the machine learning community. We are releasing, a new updated version with a fresh UI overhaul (desktop support) https://rastero.io/books/explore This project utilizes the sentence transformers library from UKP labs found here. Glad to share it with you all. Here's an account to try it out! username: reddit password: reddit2022 A demo of the interface to search using semantic similarity based search . submitted by /u/deep_ak [link] [comments]  ( 1 min )
    [Discussion] What are frameworks used for Human-In-The-Loop (Active) learning ?
    We have a new request from our product team : "Develop a human in the loop binary classifier for a fashion application. The classifier should learn the themes of photos and classify the future data with high degree of confidence. " The human in the loop will be an annotator (or) expert whom we can query to label the examples (of mini-batches). The number of queries is limited by confidence score and budget. Are there any tools/frameworks for building such (active learning) applications ? submitted by /u/UncertainLangur [link] [comments]  ( 1 min )
    [D] Anyone waiting for ECCV reviews?
    They should come out today, but I guess they are delayed. submitted by /u/SeucheAchat9115 [link] [comments]
    [P] CTranslate2: an efficient inference engine for Transformer models
    Hi! I'd like to share this project I've been working on for almost 4 years: https://github.com/OpenNMT/CTranslate2 CTranslate2 is a C++ and Python library for efficient inference with Transformer models. While the project initially focused on translation models (hence the name), it also supports autoregressive language models such as GPT-2 and the recent OPT models from Meta. The library comes with a highly optimized runtime that implements various performance optimization techniques such as weight quantization, layer fusion, batch reordering, padding removal, etc. (Check out these benchmark tables for a comparison with other frameworks on a translation task.) We provide model converters for multiple frameworks: OpenNMT Fairseq Marian Hugging Face's Transformers You can also add you own converter if the model architecture is supported. We currently support selected variants of encoder-decoder and decoder-only Transformer models (including pre-norm and post-norm). If you'd like to learn more, please visit the GitHub repository and feel free to post questions or suggestions about the project! Thanks! submitted by /u/guillaumekln [link] [comments]  ( 1 min )
    [R] Self-Net: Lifelong Learning Via Continual Self-Modeling
    submitted by /u/EducationalCicada [link] [comments]
    [N] ShiftHappens Workshop @ICML 2022 welcoming submissions & AMA
    I'm one of the organizers of the ShiftHappens Workshop hosted at ICML 2022. The goal of the workshop is to create a community-built robustness benchmark, incorporating new challenging datasets and tasks. Submissions can be ImageNet-scale datasets or evaluation tasks that help finding potentially unforeseen behaviours of tested models. All accepted contributions will be part of the benchmark and authors can become co-authors of a summary paper. To make it easy to contribute, we now also accept submissions in the form of extended abstracts, where an interesting idea for a dataset or task is explained in a one-page paper format, see also our latest tweet! We welcome datasets and tasks published at previous conferences as well as novel and work-in-progress papers. So if you e.g. evaluated your training method on a new task highlighting its advantages, we would be happy to receive that task as a submission :) If you don't, share this with friends and colleagues that might ;) Also check out our strong line-up of invited speakers which you might not want to miss if you are at the ICML conference this year! I and some of the other organizers are around to answer any questions about submitting, data shift and the workshop in general. submitted by /u/JBitterwolf [link] [comments]  ( 1 min )
    [R] Nothing makes sense in deep learning, except in the light of evolution
    submitted by /u/DevFRus [link] [comments]  ( 1 min )
    [D] - Is it safe to force stop, save weights and later resume a Tensorflow model during training?
    So I am working with a Unet (CNN) architecture with an Adam optimizer. Occasionally I have forced stopped the model, saved the weights, then resumed the training later on. When I resume, the loss plot seems to look less smooth compared to what it did earlier, but the loss also seems to have decreased significantly. Is it generally safe to stop and resume model training in this way? Also, I'm not sure how to interpret the decreased but less smooth loss function. submitted by /u/suoarski [link] [comments]  ( 2 min )
    ICML 2022 papers with affiliations [D]
    https://icml.cc/Conferences/2022/AcceptedPapersInitial On a quick ctrl-F : Tsinghua (157) overtakes Stanford (139). In 2021, these values were Tsinghua (71) and Stanford (140). http://web.archive.org/web/20210618094315/https://icml.cc/Conferences/2021/AcceptedPapersInitial ^ 2021 for comparison submitted by /u/Simping4Kaiming [link] [comments]
    [D] ML library (ideally for R) with a good software design?
    Hi! As a Machine Learning Engineer, I was studying the design patterns behind scikit-learn's API (you can see here and here) and I was wondering if any of you know of something similar but for R that I can check. Note: I am asking about R because that's what I am using and it is difficult to find something functional programming oriented for other languages, but any other library you find interesting is welcome! submitted by /u/Silver_Book_938 [link] [comments]  ( 2 min )
    [D] How to obtain frequencies of English word phrases?
    I know one way to calculate frequencies of English word phrases such as "it's raining today" is to simply search for that phrase in a large text corpus, but would anyone happen to know of any simpler/newer ways to do that? I'm wondering if it's somehow possible to calculate the frequencies of word phrases using a large language model such as GPT-3. submitted by /u/Specialist_Art_8502 [link] [comments]  ( 1 min )
  • Open

    Breakthrough Stashing Algorithm & Neural Network TPU Architecture
    submitted by /u/getrich_or_diemining [link] [comments]
    How I (Kinda) Created an A.I. Generated Cartoon
    submitted by /u/BasicallyJustASpider [link] [comments]  ( 1 min )
    The StatQuest Illustrated Guide To Machine Learning eBook
    submitted by /u/Futureisnotsecure [link] [comments]  ( 1 min )
  • Open

    Voiced and unvoiced consonants and digits
    The latest episode of The History of English Podcast discusses the history of pronunciation changes in the Elizabethan period. The episode has a lot to say about the connections between voiced and unvoiced pairs of consonants, and the circumstances under which a consonant might change from voiced to unvoiced and vice versa. The major mnemonic […] Voiced and unvoiced consonants and digits first appeared on John D. Cook.  ( 2 min )
    Year share
    This post will be about psychology as much as math, looking at a number of algorithms for mentally calculating the same function. The most difficult part of mentally computing days of the week is computing ⌊5y/4⌋ % 7 where y is the last two digits of a year. This quantity is called the year share […] Year share first appeared on John D. Cook.  ( 3 min )
  • Open

    What are some recent or timeless interesting papers on approaches to Dynamic Pricing in RL? I'm thinking uber surcharging or limited supply queue length dependant pricing policies to maximize revenue.
    submitted by /u/n3ver_summer [link] [comments]  ( 1 min )
    Is it possible to use the MuJoCo Gym environments with the new Python binding ?
    It appears Gym uses the old python integration "mujoco-py" rather than the new official one (https://mujoco.readthedocs.io/en/latest/python.html). Is it possible to use the gym environments with the new Python binding ? If not will Gym be updated to support the new official python binding ? submitted by /u/arckonte [link] [comments]  ( 1 min )
    7 Best Keras Online Courses for Deep Learning
    submitted by /u/MlTut [link] [comments]
    Is semi-gradient TD(lambda) + experience replay make sense?
    submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
  • Open

    How HRs Can Use Recruitment Data Analytics for Better Hiring
    Nearly every forward-thinking organization uses analytics in recruitment to bring efficiency to its hiring process. A significant…  ( 3 min )
  • Open

    (De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools
    It’s a well-known challenge that large language models (LLMs)—growing in popularity thanks to their adaptability across a variety of applications—carry risks. Because they’re trained on large amounts of data from across the internet, they’re capable of generating inappropriate and harmful language based on similar language encountered during training.   Content moderation tools can be deployed to […] The post (De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools appeared first on Microsoft Research.  ( 10 min )
    Partnering people with large language models to find and fix bugs in NLP systems
    Advances in platform models—large-scale models that can serve as foundations across applications—have significantly improved the ability of computers to process natural language. But natural language processing (NLP) models are still far from perfect, sometimes failing in embarrassing ways, like translating “Eu não recomendo este prato” (I don’t recommend this dish) in Portuguese to “I highly […] The post Partnering people with large language models to find and fix bugs in NLP systems appeared first on Microsoft Research.  ( 10 min )
  • Open

    Energy Grids Plug into AI for a Brighter, Cleaner Future
    Electric utilities are taking a course in machine learning to create smarter grids for tough challenges ahead. The winter 2021 megastorm in Texas left millions without power. Grid failures the past two summers sparked devastating wildfires amid California’s record drought. “Extreme weather events of 2021 highlighted the risks climate change is introducing, and the importance Read article > The post Energy Grids Plug into AI for a Brighter, Cleaner Future appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Why Should the Retail Sector Move to Cloud Computing?
    What is cloud computing? It is one of the foremost technological innovations of the 21st century and has seen the fastest adoption into mainstream use than any other technology. Simply put, it’s the ability to store and run software programs on the cloud platforms rather than storing and running them on local servers or computers.… Read More »Why Should the Retail Sector Move to Cloud Computing? The post Why Should the Retail Sector Move to Cloud Computing? appeared first on Data Science Central.  ( 4 min )
  • Open

    Deconfounding Actor-Critic Network with Policy Adaptation for Dynamic Treatment Regimes. (arXiv:2205.09852v1 [cs.LG])
    Despite intense efforts in basic and clinical research, an individualized ventilation strategy for critically ill patients remains a major challenge. Recently, dynamic treatment regime (DTR) with reinforcement learning (RL) on electronic health records (EHR) has attracted interest from both the healthcare industry and machine learning research community. However, most learned DTR policies might be biased due to the existence of confounders. Although some treatment actions non-survivors received may be helpful, if confounders cause the mortality, the training of RL models guided by long-term outcomes (e.g., 90-day mortality) would punish those treatment actions causing the learned DTR policies to be suboptimal. In this study, we develop a new deconfounding actor-critic network (DAC) to learn optimal DTR policies for patients. To alleviate confounding issues, we incorporate a patient resampling module and a confounding balance module into our actor-critic framework. To avoid punishing the effective treatment actions non-survivors received, we design a short-term reward to capture patients' immediate health state changes. Combining short-term with long-term rewards could further improve the model performance. Moreover, we introduce a policy adaptation method to successfully transfer the learned model to new-source small-scale datasets. The experimental results on one semi-synthetic and two different real-world datasets show the proposed model outperforms the state-of-the-art models. The proposed model provides individualized treatment decisions for mechanical ventilation that could improve patient outcomes.  ( 2 min )
    Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers. (arXiv:2205.08078v2 [cs.LG] UPDATED)
    Vision transformers using self-attention or its proposed alternatives have demonstrated promising results in many image related tasks. However, the underpinning inductive bias of attention is not well understood. To address this issue, this paper analyzes attention through the lens of convex duality. For the non-linear dot-product self-attention, and alternative mechanisms such as MLP-mixer and Fourier Neural Operator (FNO), we derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality. The convex programs lead to {\it block nuclear-norm regularization} that promotes low rank in the latent feature and token dimensions. In particular, we show how self-attention networks implicitly clusters the tokens, based on their latent similarity. We conduct experiments for transferring a pre-trained transformer backbone for CIFAR-100 classification by fine-tuning a variety of convex attention heads. The results indicate the merits of the bias induced by attention compared with the existing MLP or linear heads.  ( 2 min )
    Let the Model Decide its Curriculum for Multitask Learning. (arXiv:2205.09898v1 [cs.LG])
    Curriculum learning strategies in prior multi-task learning approaches arrange datasets in a difficulty hierarchy either based on human perception or by exhaustively searching the optimal arrangement. However, human perception of difficulty may not always correlate well with machine interpretation leading to poor performance and exhaustive search is computationally expensive. Addressing these concerns, we propose two classes of techniques to arrange training instances into a learning curriculum based on difficulty scores computed via model-based approaches. The two classes i.e Dataset-level and Instance-level differ in granularity of arrangement. Through comprehensive experiments with 12 datasets, we show that instance-level and dataset-level techniques result in strong representations as they lead to an average performance improvement of 4.17% and 3.15% over their respective baselines. Furthermore, we find that most of this improvement comes from correctly answering the difficult instances, implying a greater efficacy of our techniques on difficult tasks.  ( 2 min )
    Incremental Learning with Differentiable Architecture and Forgetting Search. (arXiv:2205.09875v1 [cs.LG])
    As progress is made on training machine learning models on incrementally expanding classification tasks (i.e., incremental learning), a next step is to translate this progress to industry expectations. One technique missing from incremental learning is automatic architecture design via Neural Architecture Search (NAS). In this paper, we show that leveraging NAS for incremental learning results in strong performance gains for classification tasks. Specifically, we contribute the following: first, we create a strong baseline approach for incremental learning based on Differentiable Architecture Search (DARTS) and state-of-the-art incremental learning strategies, outperforming many existing strategies trained with similar-sized popular architectures; second, we extend the idea of architecture search to regularize architecture forgetting, boosting performance past our proposed baseline. We evaluate our method on both RF signal and image classification tasks, and demonstrate we can achieve up to a 10% performance increase over state-of-the-art methods. Most importantly, our contribution enables learning from continuous distributions on real-world application data for which the complexity of the data distribution is unknown, or the modality less explored (such as RF signal classification).  ( 2 min )
    Predicting electrode array impedance after one month from cochlear implantation surgery. (arXiv:2205.10021v1 [cs.LG])
    Sensorineural hearing loss can be treated using Cochlear implantation. After this surgery using the electrode array impedance measurements, we can check the stability of the impedance value and the dynamic range. Deterioration of speech recognition scores could happen because of increased impedance values. Medicines used to do these measures many times during a year after the surgery. Predicting the electrode impedance could help in taking decisions to help the patient get better hearing. In this research we used a dataset of 80 patients of children who did cochlear implantation using MED-EL FLEX28 electrode array of 12 channels. We predicted the electrode impedance on each channel after 1 month from the surgery date. We used different machine learning algorithms like neural networks and decision trees. Our results indicates that the electrode impedance can be predicted, and the best algorithm is different based on the electrode channel. Also, the accuracy level varies between 66% and 100% based on the electrode channel when accepting an error range between 0 and 3 KO. Further research is required to predict the electrode impedance after three months, six months and one year.  ( 2 min )
    Neural-Symbolic Models for Logical Queries on Knowledge Graphs. (arXiv:2205.10128v1 [cs.AI])
    Answering complex first-order logic (FOL) queries on knowledge graphs is a fundamental task for multi-hop reasoning. Traditional symbolic methods traverse a complete knowledge graph to extract the answers, which provides good interpretation for each step. Recent neural methods learn geometric embeddings for complex queries. These methods can generalize to incomplete knowledge graphs, but their reasoning process is hard to interpret. In this paper, we propose Graph Neural Network Query Executor (GNN-QE), a neural-symbolic model that enjoys the advantages of both worlds. GNN-QE decomposes a complex FOL query into relation projections and logical operations over fuzzy sets, which provides interpretability for intermediate variables. To reason about the missing links, GNN-QE adapts a graph neural network from knowledge graph completion to execute the relation projections, and models the logical operations with product fuzzy logic. Extensive experiments on 3 datasets show that GNN-QE significantly improves over previous state-of-the-art models in answering FOL queries. Meanwhile, GNN-QE can predict the number of answers without explicit supervision, and provide visualizations for intermediate variables.  ( 2 min )
    Robust Expected Information Gain for Optimal Bayesian Experimental Design Using Ambiguity Sets. (arXiv:2205.09914v1 [stat.ML])
    The ranking of experiments by expected information gain (EIG) in Bayesian experimental design is sensitive to changes in the model's prior distribution, and the approximation of EIG yielded by sampling will have errors similar to the use of a perturbed prior. We define and analyze \emph{robust expected information gain} (REIG), a modification of the objective in EIG maximization by minimizing an affine relaxation of EIG over an ambiguity set of distributions that are close to the original prior in KL-divergence. We show that, when combined with a sampling-based approach to estimating EIG, REIG corresponds to a `log-sum-exp' stabilization of the samples used to estimate EIG, meaning that it can be efficiently implemented in practice. Numerical tests combining REIG with variational nested Monte Carlo (VNMC), adaptive contrastive estimation (ACE) and mutual information neural estimation (MINE) suggest that in practice REIG also compensates for the variability of under-sampled estimators.  ( 2 min )
    When, where, and how to add new neurons to ANNs. (arXiv:2202.08539v2 [cs.LG] UPDATED)
    Neurogenesis in ANNs is an understudied and difficult problem, even compared to other forms of structural learning like pruning. By decomposing it into triggers and initializations, we introduce a framework for studying the various facets of neurogenesis: when, where, and how to add neurons during the learning process. We present the Neural Orthogonality (NORTH*) suite of neurogenesis strategies, combining layer-wise triggers and initializations based on the orthogonality of activations or weights to dynamically grow performant networks that converge to an efficient size. We evaluate our contributions against other recent neurogenesis works across a variety of supervised learning tasks.  ( 2 min )
    Memorization and Optimization in Deep Neural Networks with Minimum Over-parameterization. (arXiv:2205.10217v1 [stat.ML])
    The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks. A line of work has studied the NTK spectrum for two-layer and deep networks with at least a layer with $\Omega(N)$ neurons, $N$ being the number of training samples. Furthermore, there is increasing evidence suggesting that deep networks with sub-linear layer widths are powerful memorizers and optimizers, as long as the number of parameters exceeds the number of samples. Thus, a natural open question is whether the NTK is well conditioned in such a challenging sub-linear setup. In this paper, we answer this question in the affirmative. Our key technical contribution is a lower bound on the smallest NTK eigenvalue for deep networks with the minimum possible over-parameterization: the number of parameters is roughly $\Omega(N)$ and, hence, the number of neurons is as little as $\Omega(\sqrt{N})$. To showcase the applicability of our NTK bounds, we provide two results concerning memorization capacity and optimization guarantees for gradient descent training.  ( 2 min )
    KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation. (arXiv:2205.09921v1 [cs.CL])
    Relative positional embeddings (RPE) have received considerable attention since RPEs effectively model the relative distance among tokens and enable length extrapolation. We propose KERPLE, a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences. We achieve this goal using conditionally positive definite (CPD) kernels, a class of functions known for generalizing distance metrics. To maintain the inner product interpretation of self-attention, we show that a CPD kernel can be transformed into a PD kernel by adding a constant offset. This offset is implicitly absorbed in the Softmax normalization during self-attention. The diversity of CPD kernels allows us to derive various RPEs that enable length extrapolation in a principled way. Experiments demonstrate that the logarithmic variant achieves excellent extrapolation performance on three large language modeling datasets.  ( 2 min )
    Mean-Field Analysis of Two-Layer Neural Networks: Global Optimality with Linear Convergence Rates. (arXiv:2205.09860v1 [cs.LG])
    We consider optimizing two-layer neural networks in the mean-field regime where the learning dynamics of network weights can be approximated by the evolution in the space of probability measures over the weight parameters associated with the neurons. The mean-field regime is a theoretically attractive alternative to the NTK (lazy training) regime which is only restricted locally in the so-called neural tangent kernel space around specialized initializations. Several prior works (\cite{mei2018mean, chizat2018global}) establish the asymptotic global optimality of the mean-field regime, but it is still challenging to obtain a quantitative convergence rate due to the complicated nonlinearity of the training dynamics. This work establishes a new linear convergence result for two-layer neural networks trained by continuous-time noisy gradient descent in the mean-field regime. Our result relies on a novelty logarithmic Sobolev inequality for two-layer neural networks, and uniform upper bounds on the logarithmic Sobolev constants for a family of measures determined by the evolving distribution of hidden neurons.  ( 2 min )
    An Artificial Neural Network Functionalized by Evolution. (arXiv:2205.10118v1 [cs.NE])
    The topology of artificial neural networks has a significant effect on their performance. Characterizing efficient topology is a field of promising research in Artificial Intelligence. However, it is not a trivial task and it is mainly experimented on through convolutional neural networks. We propose a hybrid model which combines the tensor calculus of feed-forward neural networks with Pseudo-Darwinian mechanisms. This allows for finding topologies that are well adapted for elaboration of strategies, control problems or pattern recognition tasks. In particular, the model can provide adapted topologies at early evolutionary stages, and 'structural convergence', which can found applications in robotics, big-data and artificial life.  ( 2 min )
    Topology-aware Graph Neural Networks for Learning Feasible and Adaptive ac-OPF Solutions. (arXiv:2205.10129v1 [eess.SY])
    Solving the optimal power flow (OPF) problem is a fundamental task to ensure the system efficiency and reliability in real-time electricity grid operations. We develop a new topology-informed graph neural network (GNN) approach for predicting the optimal solutions of real-time ac-OPF problem. To incorporate grid topology to the NN model, the proposed GNN-for-OPF framework innovatively exploits the locality property of locational marginal prices and voltage magnitude. Furthermore, we develop a physics-aware (ac-)flow feasibility regularization approach for general OPF learning. The advantages of our proposed designs include reduced model complexity, improved generalizability and feasibility guarantees. By providing the analytical understanding on the graph subspace stability under grid topology contingency, we show the proposed GNN can quickly adapt to varying grid topology by an efficient re-training strategy. Numerical tests on various test systems of different sizes have validated the prediction accuracy, improved flow feasibility, and topology adaptivity capability of our proposed GNN-based learning framework.  ( 2 min )
    Bounding the Effects of Continuous Treatments for Hidden Confounders. (arXiv:2204.11206v2 [stat.ME] UPDATED)
    Observational studies often seek to infer the causal effect of a treatment even though both the assigned treatment and the outcome depend on other confounding variables. An effective strategy for dealing with confounders is to estimate a propensity model that corrects for the relationship between covariates and assigned treatment. Unfortunately, the confounding variables themselves are not always observed, in which case we can only bound the propensity, and therefore bound the magnitude of causal effects. In many important cases, like administering a dose of some medicine, the possible treatments belong to a continuum. Sensitivity models, which are required to tie the true propensity to something that can be estimated, have been explored for binary treatments. We propose one for continuous treatments. We develop a framework to compute ignorance intervals on the partially identified dose-response curves, enabling us to quantify the susceptibility of an inference to hidden confounders. We show with simulations and three real-world observational studies that our approach can give non-trivial bounds on causal effects from continuous treatments in the presence of hidden confounders.
    Sparse Infinite Random Feature Latent Variable Modeling. (arXiv:2205.09909v1 [stat.ML])
    We propose a non-linear, Bayesian non-parametric latent variable model where the latent space is assumed to be sparse and infinite dimensional a priori using an Indian buffet process prior. A posteriori, the number of instantiated dimensions in the latent space is guaranteed to be finite. The purpose of placing the Indian buffet process on the latent variables is to: 1.) Automatically and probabilistically select the number of latent dimensions. 2.) Impose sparsity in the latent space, where the Indian buffet process will select which elements are exactly zero. Our proposed model allows for sparse, non-linear latent variable modeling where the number of latent dimensions is selected automatically. Inference is made tractable using the random Fourier approximation and we can easily implement posterior inference through Markov chain Monte Carlo sampling. This approach is amenable to many observation models beyond the Gaussian setting. We demonstrate the utility of our method on a variety of synthetic, biological and text datasets and show that we can obtain superior test set performance compared to previous latent variable models.
    RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk. (arXiv:2205.10004v1 [cs.LG])
    Failures and anomalies in large-scale software systems are unavoidable incidents. When an issue is detected, operators need to quickly and correctly identify its location to facilitate a swift repair. In this work, we consider the problem of identifying the root cause set that best explains an anomaly in multi-dimensional time series with categorical attributes. The huge search space is the main challenge, even for a small number of attributes and small value sets, the number of theoretical combinations is too large to brute force. Previous approaches have thus focused on reducing the search space, but they all suffer from various issues, requiring extensive manual parameter tuning, being too slow and thus impractical, or being incapable of finding more complex root causes. We propose RiskLoc to solve the problem of multidimensional root cause localization. RiskLoc applies a 2-way partitioning scheme and assigns element weights that linearly increase with the distance from the partitioning point. A risk score is assigned to each element that integrates two factors, 1) its weighted proportion within the abnormal partition, and 2) the relative change in the deviation score adjusted for the ripple effect property. Extensive experiments on multiple datasets verify the effectiveness and efficiency of RiskLoc, and for a comprehensive evaluation, we introduce three synthetically generated datasets that complement existing datasets. We demonstrate that RiskLoc consistently outperforms state-of-the-art baselines, especially in more challenging root cause scenarios, with gains in F1-score up to 57% over the second-best approach with comparable running times.
    Sigmoidally Preconditioned Off-policy Learning:a new exploration method for reinforcement learning. (arXiv:2205.10047v1 [cs.LG])
    One of the major difficulties of reinforcement learning is learning from {\em off-policy} samples, which are collected by a different policy (behavior policy) from what the algorithm evaluates (the target policy). Off-policy learning needs to correct the distribution of the samples from the behavior policy towards that of the target policy. Unfortunately, important sampling has an inherent high variance issue which leads to poor gradient estimation in policy gradient methods. We focus on an off-policy Actor-Critic architecture, and propose a novel method, called Preconditioned Proximal Policy Optimization (P3O), which can control the high variance of importance sampling by applying a preconditioner to the Conservative Policy Iteration (CPI) objective. {\em This preconditioning uses the sigmoid function in a special way that when there is no policy change, the gradient is maximal and hence policy gradient will drive a big parameter update for an efficient exploration of the parameter space}. This is a novel exploration method that has not been studied before given that existing exploration methods are based on the novelty of states and actions. We compare with several best-performing algorithms on both discrete and continuous tasks and the results confirmed that {\em P3O is more off-policy than PPO} according to the "off-policyness" measured by the DEON metric, and P3O explores in a larger policy space than PPO. Results also show that our P3O maximizes the CPI objective better than PPO during the training process.
    Understanding Why Generalized Reweighting Does Not Improve Over ERM. (arXiv:2201.12293v3 [cs.LG] UPDATED)
    Empirical risk minimization (ERM) is known in practice to be non-robust to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to the class of GRW approaches.
    Optimal Parameter-free Online Learning with Switching Cost. (arXiv:2205.06846v2 [cs.LG] UPDATED)
    Parameter-freeness in online learning refers to the adaptivity of an algorithm with respect to the optimal decision in hindsight. In this paper, we design such algorithms in the presence of switching cost - the latter penalizes the optimistic updates required by parameter-freeness, leading to a delicate design trade-off. Based on a novel dual space scaling strategy, we propose a simple yet powerful algorithm for Online Linear Optimization (OLO) with switching cost, which improves the existing suboptimal regret bound [ZCP22a] to the optimal rate. The obtained benefit is extended to the expert setting, and the practicality of our algorithm is demonstrated through a sequential investment task.
    A toolbox for idea generation and evaluation: Machine learning, data-driven, and contest-driven approaches to support idea generation. (arXiv:2205.09840v1 [cs.LG])
    The significance and abundance of data are increasing due to the growing digital data generated from social media, sensors, scholarly literature, patents, different forms of documents published online, databases, product manuals, etc. Various data sources can be used to generate ideas, yet, in addition to bias, the size of the available digital data is a major challenge when it comes to manual analysis. Hence, human-machine interaction is essential for generating valuable ideas where machine learning and data-driven techniques generate patterns from data and serve human sense-making. However, the use of machine learning and data-driven approaches to generate ideas is a relatively new area. Moreover, it is also possible to stimulate innovation using contest-driven idea generation and evaluation. The results and contributions of this thesis can be viewed as a toolbox of idea-generation techniques, including a list of data-driven and machine learning techniques with corresponding data sources and models to support idea generation. In addition, the results include two models, one method and one framework, to better support data-driven and contest- driven idea generation. The beneficiaries of these artefacts are practitioners in data and knowledge engineering, data mining project managers, and innovation agents. Innovation agents include incubators, contest organizers, consultants, innovation accelerators, and industries. Since the proposed artefacts consist of process models augmented with AI techniques, human-centred AI is a promising area of research that can contribute to the artefacts' further development and promote creativity.
    STaR: Bootstrapping Reasoning With Reasoning. (arXiv:2203.14465v2 [cs.LG] UPDATED)
    Generating step-by-step "chain-of-thought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering. However, inducing language model rationale generation currently requires either constructing massive rationale datasets or sacrificing accuracy by using only few-shot inference. We propose a technique to iteratively leverage a small number of rationale examples and a large dataset without rationales, to bootstrap the ability to perform successively more complex reasoning. This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers, and performs comparably to fine-tuning a 30$\times$ larger state-of-the-art language model on CommensenseQA. Thus, STaR lets a model improve itself by learning from its own generated reasoning.
    Lossless Speedup of Autoregressive Translation with Generalized Aggressive Decoding. (arXiv:2203.16487v4 [cs.CL] UPDATED)
    Different from previous work accelerating translation at the cost of quality loss, we propose Generalized Aggressive Decoding (GAD) -- a novel decoding paradigm for lossless speedup of autoregressive translation, through the collaboration of autoregressive and non-autoregressive translation (NAT) of the Transformer. At each decoding iteration, GAD aggressively decodes a number of tokens with NAT as a draft and then verifies them in the autoregressive manner, where only the tokens that pass the verification are accepted as decoded tokens. GAD can achieve the same results as autoregressive translation but much more efficiently because both NAT drafting and autoregressive verification compute in parallel. We conduct experiments in four standard WMT benchmarks and confirm that the vanilla GAD yields exactly the same results as greedy decoding with an around $3\times$ speedup, and that its variant (GAD++) with an advanced verification strategy not only outperforms the greedy translation and even achieves the comparable translation quality with the beam search result, but also further improves the decoding speed, resulting in an around $5\times$ speedup over autoregressive translation. Moreover, GAD can be easily generalized for lossless speedup of other seq2seq tasks like Abstractive Summarization, and benefit more from stronger computing devices, demonstrating its potential to become a de facto decoding paradigm in the future. Our models and codes are available at https://github.com/hemingkx/GAD.
    Mosaic Zonotope Shadow Matching for Risk-Aware Autonomous Localization in Harsh Urban Environments. (arXiv:2205.10223v1 [cs.AI])
    Risk-aware urban localization with the Global Navigation Satellite System (GNSS) remains an unsolved problem with frequent misdetection of the user's street or side of the street. Significant advances in 3D map-aided GNSS use grid-based GNSS shadow matching alongside AI-driven line-of-sight (LOS) classifiers and server-based processing to improve localization accuracy, especially in the cross-street direction. Our prior work introduces a new paradigm for shadow matching that proposes set-valued localization with computationally efficient zonotope set representations. While existing literature improved accuracy and efficiency, the current state of shadow matching theory does not address the needs of risk-aware autonomous systems. We extend our prior work to propose Mosaic Zonotope Shadow Matching (MZSM) that employs a classifier-agnostic polytope mosaic architecture to provide risk-awareness and certifiable guarantees on urban positioning. We formulate a recursively expanding binary tree that refines an initial location estimate with set operations into smaller polytopes. Together, the smaller polytopes form a mosaic. We weight the tree branches with the probability that the user is in line of sight of the satellite and expand the tree with each new satellite observation. Our method yields an exact shadow matching distribution from which we guarantee uncertainty bounds on the user localization. We perform high-fidelity simulations using a 3D building map of San Francisco to validate our algorithm's risk-aware improvements. We demonstrate that MZSM provides certifiable guarantees across varied data-driven LOS classifier accuracies and yields a more precise understanding of the uncertainty over existing methods. We validate that our tree-based construction is efficient and tractable, computing a mosaic from 14 satellites in 0.63 seconds and growing quadratically in the satellite number.
    An alternative proof of the vulnerability of retrieval in high intrinsic dimensionality neighborhood. (arXiv:2010.00990v2 [cs.LG] UPDATED)
    This paper investigates the vulnerability of the nearest neighbors search, which is a pivotal tool in data analysis and machine learning. The vulnerability is gauged as the relative amount of perturbation that an attacker needs to add onto a dataset point in order to modify its neighbor rank w.r.t. a query. The statistical distribution of this quantity is derived from simple assumptions. Experiments on six large scale datasets validate this model up to some outliers which are explained in term of violations of the assumptions.
    Breaking the $\sqrt{T}$ Barrier: Instance-Independent Logarithmic Regret in Stochastic Contextual Linear Bandits. (arXiv:2205.09899v1 [stat.ML])
    We prove an instance independent (poly) logarithmic regret for stochastic contextual bandits with linear payoff. Previously, in \cite{chu2011contextual}, a lower bound of $\mathcal{O}(\sqrt{T})$ is shown for the contextual linear bandit problem with arbitrary (adversarily chosen) contexts. In this paper, we show that stochastic contexts indeed help to reduce the regret from $\sqrt{T}$ to $\polylog(T)$. We propose Low Regret Stochastic Contextual Bandits (\texttt{LR-SCB}), which takes advantage of the stochastic contexts and performs parameter estimation (in $\ell_2$ norm) and regret minimization simultaneously. \texttt{LR-SCB} works in epochs, where the parameter estimation of the previous epoch is used to reduce the regret of the current epoch. The (poly) logarithmic regret of \texttt{LR-SCB} stems from two crucial facts: (a) the application of a norm adaptive algorithm to exploit the parameter estimation and (b) an analysis of the shifted linear contextual bandit algorithm, showing that shifting results in increasing regret. We have also shown experimentally that stochastic contexts indeed incurs a regret that scales with $\polylog(T)$.
    Is explainable AI a race against model complexity?. (arXiv:2205.10119v1 [cs.AI])
    Explaining the behaviour of intelligent systems will get increasingly and perhaps intractably challenging as models grow in size and complexity. We may not be able to expect an explanation for every prediction made by a brain-scale model, nor can we expect explanations to remain objective or apolitical. Our functionalist understanding of these models is of less advantage than we might assume. Models precede explanations, and can be useful even when both model and explanation are incorrect. Explainability may never win the race against complexity, but this is less problematic than it seems.
    Nothing makes sense in deep learning, except in the light of evolution. (arXiv:2205.10320v1 [cs.LG])
    Deep Learning (DL) is a surprisingly successful branch of machine learning. The success of DL is usually explained by focusing analysis on a particular recent algorithm and its traits. Instead, we propose that an explanation of the success of DL must look at the population of all algorithms in the field and how they have evolved over time. We argue that cultural evolution is a useful framework to explain the success of DL. In analogy to biology, we use `development' to mean the process converting the pseudocode or text description of an algorithm into a fully trained model. This includes writing the programming code, compiling and running the program, and training the model. If all parts of the process don't align well then the resultant model will be useless (if the code runs at all!). This is a constraint. A core component of evolutionary developmental biology is the concept of deconstraints -- these are modification to the developmental process that avoid complete failure by automatically accommodating changes in other components. We suggest that many important innovations in DL, from neural networks themselves to hyperparameter optimization and AutoGrad, can be seen as developmental deconstraints. These deconstraints can be very helpful to both the particular algorithm in how it handles challenges in implementation and the overall field of DL in how easy it is for new ideas to be generated. We highlight how our perspective can both advance DL and lead to new insights for evolutionary biology.
    On the Representation Collapse of Sparse Mixture of Experts. (arXiv:2204.09179v2 [cs.CL] UPDATED)
    Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.
    Towards a Holistic View on Argument Quality Prediction. (arXiv:2205.09803v1 [cs.CL])
    Argumentation is one of society's foundational pillars, and, sparked by advances in NLP and the vast availability of text data, automated mining of arguments receives increasing attention. A decisive property of arguments is their strength or quality. While there are works on the automated estimation of argument strength, their scope is narrow: they focus on isolated datasets and neglect the interactions with related argument mining tasks, such as argument identification, evidence detection, or emotional appeal. In this work, we close this gap by approaching argument quality estimation from multiple different angles: Grounded on rich results from thorough empirical evaluations, we assess the generalization capabilities of argument quality estimation across diverse domains, the interplay with related argument mining tasks, and the impact of emotions on perceived argument strength. We find that generalization depends on a sufficient representation of different domains in the training part. In zero-shot transfer and multi-task experiments, we reveal that argument quality is among the more challenging tasks but can improve others. Finally, we show that emotions play a minor role in argument quality than is often assumed.
    A Survey of Trustworthy Graph Learning: Reliability, Explainability, and Privacy Protection. (arXiv:2205.10014v1 [cs.LG])
    Deep graph learning has achieved remarkable progresses in both business and scientific areas ranging from finance and e-commerce, to drug and advanced material discovery. Despite these progresses, how to ensure various deep graph learning algorithms behave in a socially responsible manner and meet regulatory compliance requirements becomes an emerging problem, especially in risk-sensitive domains. Trustworthy graph learning (TwGL) aims to solve the above problems from a technical viewpoint. In contrast to conventional graph learning research which mainly cares about model performance, TwGL considers various reliability and safety aspects of the graph learning framework including but not limited to robustness, explainability, and privacy. In this survey, we provide a comprehensive review of recent leading approaches in the TwGL field from three dimensions, namely, reliability, explainability, and privacy protection. We give a general categorization for existing work and review typical work for each category. To give further insights for TwGL research, we provide a unified view to inspect previous works and build the connection between them. We also point out some important open problems remaining to be solved in the future developments of TwGL.
    Real Time Multi-Object Detection for Helmet Safety. (arXiv:2205.09878v1 [cs.CV])
    The National Football League and Amazon Web Services teamed up to develop the best sports injury surveillance and mitigation program via the Kaggle competition. Through which the NFL wants to assign specific players to each helmet, which would help accurately identify each player's "exposures" throughout a football play. We are trying to implement a computer vision based ML algorithms capable of assigning detected helmet impacts to correct players via tracking information. Our paper will explain the approach to automatically track player helmets and their collisions. This will also allow them to review previous plays and explore the trends in exposure over time.
    On Tackling Explanation Redundancy in Decision Trees. (arXiv:2205.09971v1 [cs.AI])
    Decision trees (DTs) epitomize the ideal of interpretability of machine learning (ML) models. The interpretability of decision trees motivates explainability approaches by so-called intrinsic interpretability, and it is at the core of recent proposals for applying interpretable ML models in high-risk applications. The belief in DT interpretability is justified by the fact that explanations for DT predictions are generally expected to be succinct. Indeed, in the case of DTs, explanations correspond to DT paths. Since decision trees are ideally shallow, and so paths contain far fewer features than the total number of features, explanations in DTs are expected to be succinct, and hence interpretable. This paper offers both theoretical and experimental arguments demonstrating that, as long as interpretability of decision trees equates with succinctness of explanations, then decision trees ought not be deemed interpretable. The paper introduces logically rigorous path explanations and path explanation redundancy, and proves that there exist functions for which decision trees must exhibit paths with arbitrarily large explanation redundancy. The paper also proves that only a very restricted class of functions can be represented with DTs that exhibit no explanation redundancy. In addition, the paper includes experimental results substantiating that path explanation redundancy is observed ubiquitously in decision trees, including those obtained using different tree learning algorithms, but also in a wide range of publicly available decision trees. The paper also proposes polynomial-time algorithms for eliminating path explanation redundancy, which in practice require negligible time to compute. Thus, these algorithms serve to indirectly attain irreducible, and so succinct, explanations for decision trees.
    Sequentially learning the topological ordering of causal directed acyclic graphs with likelihood ratio scores. (arXiv:2202.01748v2 [stat.ME] UPDATED)
    Causal discovery, the learning of causality in a data mining scenario, has been of strong scientific and theoretical interest as a starting point to identify "what causes what?" Contingent on assumptions and a proper learning algorithm, it is sometimes possible to identify and accurately estimate a causal directed acyclic graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions. The focus of this paper is in highlighting the identifiability and estimation of DAGs with general error distributions through a general sequential sorting procedure that orders variables one at a time, starting at root nodes, followed by children of the root nodes, and so on until completion. We demonstrate a novel application of this general approach to estimate the topological ordering of a DAG. At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering. The computational complexity of our algorithm on a p-node problem is O(pd), where d is the maximum neighborhood size. Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying DAG. We provide extensive numerical evidence to demonstrate that this sequential procedure scales to possibly thousands of nodes and works well for high-dimensional data. We accompany these numerical experiments with an application to a single-cell gene expression dataset.
    Mitigating Statistical Bias within Differentially Private Synthetic Data. (arXiv:2108.10934v3 [stat.ML] UPDATED)
    Increasing interest in privacy-preserving machine learning has led to new and evolved approaches for generating private synthetic data from undisclosed real data. However, mechanisms of privacy preservation can significantly reduce the utility of synthetic data, which in turn impacts downstream tasks such as learning predictive models or inference. We propose several re-weighting strategies using privatised likelihood ratios that not only mitigate statistical bias of downstream estimators but also have general applicability to differentially private generative models. Through large-scale empirical evaluation, we show that private importance weighting provides simple and effective privacy-compliant augmentation for general applications of synthetic data.
    Permutation Predictions for Non-Clairvoyant Scheduling. (arXiv:2202.10199v2 [cs.DS] UPDATED)
    In non-clairvoyant scheduling, the task is to find an online strategy for scheduling jobs with a priori unknown processing requirements with the objective to minimize the total (weighted) completion time. We revisit this well-studied problem in a recently popular learning-augmented setting that integrates (untrusted) predictions in online algorithm design. While previous works used predictions on processing requirements, we propose a new prediction model, which provides a relative order of jobs which could be seen as predicting algorithmic actions rather than parts of the unknown input. We show that these predictions have desired properties, admit a natural error measure as well as algorithms with strong performance guarantees and that they are learnable in both, theory and practice. We generalize the algorithmic framework proposed in the seminal paper by Kumar et al. (NeurIPS'18) and present the first learning-augmented scheduling results for weighted jobs and unrelated machines. We demonstrate in empirical experiments the practicability and superior performance compared to the previously suggested single-machine algorithms.
    Function Regression using Spiking DeepONet. (arXiv:2205.10130v1 [cs.NE])
    One of the main broad applications of deep learning is function regression. However, despite their demonstrated accuracy and robustness, modern neural network architectures require heavy computational resources to train. One method to mitigate or even resolve this inefficiency has been to draw further inspiration from the brain and reformulate the learning process in a more biologically-plausible way, developing what are known as Spiking Neural Networks (SNNs), which have been gaining traction in recent years. In this paper we present an SNN-based method to perform regression, which has been a challenge due to the inherent difficulty in representing a function's input domain and continuous output values as spikes. We use a DeepONet - neural network designed to learn operators - to learn the behavior of spikes. Then, we use this approach to do function regression. We propose several methods to use a DeepONet in the spiking framework, and present accuracy and training time for different benchmarks.
    Millimeter Wave Localization with Imperfect Training Data using Shallow Neural Networks. (arXiv:2112.05008v2 [cs.NI] UPDATED)
    Millimeter wave (mmWave) localization algorithms exploit the quasi-optical propagation of mmWave signals, which yields sparse angular spectra at the receiver. Geometric approaches to angle-based localization typically require to know the map of the environment and the location of the access points. Thus, several works have resorted to automated learning in order to infer a device's location from the properties of the received mmWave signals. However, collecting training data for such models is a significant burden. In this work, we propose a shallow neural network model to localize mmWave devices indoors. This model requires significantly fewer weights than those proposed in the literature. Therefore, it is amenable for implementation in resource-constrained hardware, and needs fewer training samples to converge. We also propose to relieve training data collection efforts by retrieving (inherently imperfect) location estimates from geometry-based mmWave localization algorithms. Even in this case, our results show that the proposed neural networks perform as good as or better than state-of-the-art algorithms.
    Beyond Labels: Visual Representations for Bone Marrow Cell Morphology Recognition. (arXiv:2205.09880v1 [cs.CV])
    Analyzing and inspecting bone marrow cell cytomorphology is a critical but highly complex and time-consuming component of hematopathology diagnosis. Recent advancements in artificial intelligence have paved the way for the application of deep learning algorithms to complex medical tasks. Nevertheless, there are many challenges in applying effective learning algorithms to medical image analysis, such as the lack of sufficient and reliably annotated training datasets and the highly class-imbalanced nature of most medical data. Here, we improve on the state-of-the-art methodologies of bone marrow cell recognition by deviating from sole reliance on labeled data and leveraging self-supervision in training our learning models. We investigate our approach's effectiveness in identifying bone marrow cell types. Our experiments demonstrate significant performance improvements in conducting different bone marrow cell recognition tasks compared to the current state-of-the-art methodologies.
    Towards efficient feature sharing in MIMO architectures. (arXiv:2205.10139v1 [cs.LG])
    Multi-input multi-output architectures propose to train multiple subnetworks within one base network and then average the subnetwork predictions to benefit from ensembling for free. Despite some relative success, these architectures are wasteful in their use of parameters. Indeed, we highlight in this paper that the learned subnetwork fail to share even generic features which limits their applicability on smaller mobile and AR/VR devices. We posit this behavior stems from an ill-posed part of the multi-input multi-output framework. To solve this issue, we propose a novel unmixing step in MIMO architectures that allows subnetworks to properly share features. Preliminary experiments on CIFAR-100 show our adjustments allow feature sharing and improve model performance for small architectures.
    WALNUT: A Benchmark on Weakly Supervised Learning for Natural Language Understanding. (arXiv:2108.12603v2 [cs.CL] UPDATED)
    Building machine learning models for natural language understanding (NLU) tasks relies heavily on labeled data. Weak supervision has been proven valuable when large amount of labeled data is unavailable or expensive to obtain. Existing works studying weak supervision for NLU either mostly focus on a specific task or simulate weak supervision signals from ground-truth labels. It is thus hard to compare different approaches and evaluate the benefit of weak supervision without access to a unified and systematic benchmark with diverse tasks and real-world weak labeling rules. In this paper, we propose such a benchmark, named WALNUT (semi-WeAkly supervised Learning for Natural language Understanding Testbed), to advocate and facilitate research on weak supervision for NLU. WALNUT consists of NLU tasks with different types, including document-level and token-level prediction tasks. WALNUT is the first semi-weakly supervised learning benchmark for NLU, where each task contains weak labels generated by multiple real-world weak sources, together with a small set of clean labels. We conduct baseline evaluations on WALNUT to systematically evaluate the effectiveness of various weak supervision methods and model architectures. Our results demonstrate the benefit of weak supervision for low-resource NLU tasks and highlight interesting patterns across tasks. We expect WALNUT to stimulate further research on methodologies to leverage weak supervision more effectively. The benchmark and code for baselines are available at \url{aka.ms/walnut_benchmark}.
    HyBNN and FedHyBNN: (Federated) Hybrid Binary Neural Networks. (arXiv:2205.09839v1 [cs.LG])
    Binary Neural Networks (BNNs), neural networks with weights and activations constrained to -1(0) and +1, are an alternative to deep neural networks which offer faster training, lower memory consumption and lightweight models, ideal for use in resource constrained devices while being able to utilize the architecture of their deep neural network counterpart. However, the input binarization step used in BNNs causes a severe accuracy loss. In this paper, we introduce a novel hybrid neural network architecture, Hybrid Binary Neural Network (HyBNN), consisting of a task-independent, general, full-precision variational autoencoder with a binary latent space and a task specific binary neural network that is able to greatly limit the accuracy loss due to input binarization by using the full precision variational autoencoder as a feature extractor. We use it to combine the state-of-the-art accuracy of deep neural networks with the much faster training time, quicker test-time inference and power efficiency of binary neural networks. We show that our proposed system is able to very significantly outperform a vanilla binary neural network with input binarization. We also introduce FedHyBNN, a highly communication efficient federated counterpart to HyBNN and demonstrate that it is able to reach the same accuracy as its non-federated equivalent. We make our source code, experimental parameters and models available at: https://anonymous.4open.science/r/HyBNN.
    MaskGAE: Masked Graph Modeling Meets Graph Autoencoders. (arXiv:2205.10053v1 [cs.LG])
    We present masked graph autoencoder (MaskGAE), a self-supervised learning framework for graph-structured data. Different from previous graph autoencoders (GAEs), MaskGAE adopts masked graph modeling (MGM) as a principled pretext task: masking a portion of edges and attempting to reconstruct the missing part with partially visible, unmasked graph structure. To understand whether MGM can help GAEs learn better representations, we provide both theoretical and empirical evidence to justify the benefits of this pretext task. Theoretically, we establish the connections between GAEs and contrastive learning, showing that MGM significantly improves the self-supervised learning scheme of GAEs. Empirically, we conduct extensive experiments on a number of benchmark datasets, demonstrating the superiority of MaskGAE over several state-of-the-arts on both link prediction and node classification tasks. Our code is publicly available at \url{https://github.com/EdisonLeeeee/MaskGAE}.
    Stochastic resonance neurons in artificial neural networks. (arXiv:2205.10122v1 [cs.NE])
    Many modern applications of the artificial neural networks ensue large number of layers making traditional digital implementations increasingly complex. Optical neural networks offer parallel processing at high bandwidth, but have the challenge of noise accumulation. We propose here a new type of neural networks using stochastic resonances as an inherent part of the architecture and demonstrate a possibility of significant reduction of the required number of neurons for a given performance accuracy. We also show that such a neural network is more robust against the impact of noise.
    A Rule Search Framework for the Early Identification of Chronic Emergency Homeless Shelter Clients. (arXiv:2205.09883v1 [cs.CY])
    This paper uses rule search techniques for the early identification of emergency homeless shelter clients who are at risk of becoming long term or chronic shelter users. Using a data set from a major North American shelter containing 12 years of service interactions with over 40,000 individuals, the optimized pruning for unordered search (OPUS) algorithm is used to develop rules that are both intuitive and effective. The rules are evaluated within a framework compatible with the real-time delivery of a housing program meant to transition high risk clients to supportive housing. Results demonstrate that the median time to identification of clients at risk of chronic shelter use drops from 297 days to 162 days when the methods in this paper are applied.
    On the Prediction Instability of Graph Neural Networks. (arXiv:2205.10070v1 [cs.LG])
    Instability of trained models, i.e., the dependence of individual node predictions on random factors, can affect reproducibility, reliability, and trust in machine learning systems. In this paper, we systematically assess the prediction instability of node classification with state-of-the-art Graph Neural Networks (GNNs). With our experiments, we establish that multiple instantiations of popular GNN models trained on the same data with the same model hyperparameters result in almost identical aggregated performance but display substantial disagreement in the predictions for individual nodes. We find that up to one third of the incorrectly classified nodes differ across algorithm runs. We identify correlations between hyperparameters, node properties, and the size of the training set with the stability of predictions. In general, maximizing model performance implicitly also reduces model instability.
    Self-Paced Multi-Agent Reinforcement Learning. (arXiv:2205.10016v1 [cs.AI])
    Curriculum reinforcement learning (CRL) aims to speed up learning of a task by changing gradually the difficulty of the task from easy to hard through control of factors such as initial state or environment dynamics. While automating CRL is well studied in the single-agent setting, in multi-agent reinforcement learning (MARL) an open question is whether control of the number of agents with other factors in a principled manner is beneficial, prior approaches typically relying on hand-crafted heuristics. In addition, how the tasks evolve as the number of agents changes remains understudied, which is critical for scaling to more challenging tasks. We introduce self-paced MARL (SPMARL) that enables optimizing the number of agents with other environment factors in a principled way, and, show that usual assumptions such as that fewer agents make the task always easier are not generally valid. The curriculum induced by SPMARL reveals the evolution of tasks w.r.t. number of agents and experiments show that SPMARL improves the performance when the number of agents sufficiently influences task difficulty.
    HCMD-zero: Learning Value Aligned Mechanisms from Data. (arXiv:2202.10122v2 [cs.MA] UPDATED)
    Artificial learning agents are mediating a larger and larger number of interactions among humans, firms, and organizations, and the intersection between mechanism design and machine learning has been heavily investigated in recent years. However, mechanism design methods often make strong assumptions on how participants behave (e.g. rationality), on the kind of knowledge designers have access to a priori (e.g. access to strong baseline mechanisms), or on what the goal of the mechanism should be (e.g. total welfare). Here we introduce HCMD-zero, a general purpose method to construct mechanisms making none of these three assumptions. HCMD-zero learns to mediate interactions among participants and adjusts the mechanism parameters to make itself more likely to be preferred by participants. It does so by remaining engaged in an electoral contest with copies of itself, thereby accessing direct feedback from participants. We test our method on a stylized resource allocation game that highlights the tension between productivity, equality and the temptation to free ride. HCMD-zero produces a mechanism that is preferred by human participants over a strong baseline, it does so automatically, without requiring prior knowledge, and using human behavioral trajectories sparingly and effectively. Our analysis shows HCMD-zero consistently makes the mechanism policy more and more likely to be preferred by human participants over the course of training, and that it results in a mechanism with an interpretable and intuitive policy.
    How to Minimize the Weighted Sum AoI in Multi-Source Status Update Systems: OMA or NOMA?. (arXiv:2205.03143v2 [cs.IT] UPDATED)
    In this paper, the minimization of the weighted sum average age of information (AoI) in a multi-source status update communication system is studied. Multiple independent sources send update packets to a common destination node in a time-slotted manner under the limit of maximum retransmission rounds. Different multiple access schemes, i.e., orthogonal multiple access (OMA) and non-orthogonal multiple access (NOMA) are exploited here over a block-fading multiple access channel (MAC). Constrained Markov decision process (CMDP) problems are formulated to describe the AoI minimization problems considering both transmission schemes. The Lagrangian method is utilised to convert CMDP problems to unconstraint Markov decision process (MDP) problems and corresponding algorithms to derive the power allocation policies are obtained. On the other hand, for the case of unknown environments, two online reinforcement learning approaches considering both multiple access schemes are proposed to achieve near-optimal age performance. Numerical simulations validate the improvement of the proposed policy in terms of weighted sum AoI compared to the fixed power transmission policy, and illustrate that NOMA is more favorable in case of larger packet size.
    Deep electric field predictions by drift-reduced Braginskii theory with plasma-neutral interactions based upon experimental images of boundary turbulence. (arXiv:2204.11689v1 [physics.plasm-ph] CROSS LISTED)
    We present 2-dimensional turbulent electric field calculations via physics-informed deep learning consistent with (i) drift-reduced Braginskii theory under the framework of an axisymmetric fusion plasma with purely toroidal field and (ii) experimental estimates of the fluctuating electron density and temperature obtained from analysis of gas puff imaging of a discharge on the Alcator C-Mod tokamak. The inclusion of effects from the locally puffed atomic helium on particle and energy sources within the reduced plasma turbulence model are found to strengthen correlations between the electric field and electron pressure. The neutrals are also directly associated with an observed broadening in the distribution of turbulent field amplitudes and increased ${\bf E \times B}$ shearing rates.
    DDDM: a Brain-Inspired Framework for Robust Classification. (arXiv:2205.10117v1 [cs.NE])
    Despite their outstanding performance in a broad spectrum of real-world tasks, deep artificial neural networks are sensitive to input noises, particularly adversarial perturbations. On the contrary, human and animal brains are much less vulnerable. In contrast to the one-shot inference performed by most deep neural networks, the brain often solves decision-making with an evidence accumulation mechanism that may trade time for accuracy when facing noisy inputs. The mechanism is well described by the Drift-Diffusion Model (DDM). In the DDM, decision-making is modeled as a process in which noisy evidence is accumulated toward a threshold. Drawing inspiration from the DDM, we propose the Dropout-based Drift-Diffusion Model (DDDM) that combines test-phase dropout and the DDM for improving the robustness for arbitrary neural networks. The dropouts create temporally uncorrelated noises in the network that counter perturbations, while the evidence accumulation mechanism guarantees a reasonable decision accuracy. Neural networks enhanced with the DDDM tested in image, speech, and text classification tasks all significantly outperform their native counterparts, demonstrating the DDDM as a task-agnostic defense against adversarial attacks.
    Classifying Human Activities using Machine Learning and Deep Learning Techniques. (arXiv:2205.10325v1 [cs.LG])
    Human Activity Recognition (HAR) describes the machines ability to recognize human actions. Nowadays, most people on earth are health conscious, so people are more interested in tracking their daily activities using Smartphones or Smart Watches, which can help them manage their daily routines in a healthy way. With this objective, Kaggle has conducted a competition to classify 6 different human activities distinctly based on the inertial signals obtained from 30 volunteers smartphones. The main challenge in HAR is to overcome the difficulties of separating human activities based on the given data such that no two activities overlap. In this experimentation, first, Data visualization is done on expert generated features with the help of t distributed Stochastic Neighborhood Embedding followed by applying various Machine Learning techniques like Logistic Regression, Linear SVC, Kernel SVM, Decision trees to better classify the 6 distinct human activities. Moreover, Deep Learning techniques like Long Short-Term Memory (LSTM), Bi-Directional LSTM, Recurrent Neural Network (RNN), and Gated Recurrent Unit (GRU) are trained using raw time series data. Finally, metrics like Accuracy, Confusion matrix, precision and recall are used to evaluate the performance of the Machine Learning and Deep Learning models. Experiment results proved that the Linear Support Vector Classifier in machine learning and Gated Recurrent Unit in Deep Learning provided better accuracy for human activity recognition compared to other classifiers.
    Hybrid Transfer in Deep Reinforcement Learning for Ads Allocation. (arXiv:2204.11589v2 [cs.IR] UPDATED)
    Ads allocation, which involves allocating ads and organic items to limited slots in feed with the purpose of maximizing platform revenue, has become a research hotspot. Notice that, e-commerce platforms usually have multiple entrances for different categories and some entrances have few visits. Data from these entrances has low coverage, which makes it difficult for the agent to learn. To address this challenge, we propose Similarity-based Hybrid Transfer for Ads Allocation (SHTAA), which effectively transfers samples as well as knowledge from data-rich entrance to data-poor entrance. Specifically, we define an uncertainty-aware similarity for MDP to estimate the similarity of MDP for different entrances. Based on this similarity, we design a hybrid transfer method, including instance transfer and strategy transfer, to efficiently transfer samples and knowledge from one entrance to another. Both offline and online experiments on Meituan food delivery platform demonstrate that the proposed method could achieve better performance for data-poor entrance and increase the revenue for the platform.
    Can Foundation Models Wrangle Your Data?. (arXiv:2205.09911v1 [cs.LG])
    Foundation Models (FMs) are models trained on large corpora of data that, at very large scale, can generalize to new tasks without any task-specific finetuning. As these models continue to grow in size, innovations continue to push the boundaries of what these models can do on language and image tasks. This paper aims to understand an underexplored area of FMs: classical data tasks like cleaning and integration. As a proof-of-concept, we cast three data cleaning and integration tasks as prompting tasks and evaluate the performance of FMs on these tasks. We find that large FMs generalize and achieve SoTA performance on data cleaning and integration tasks, even though they are not trained for these data tasks. We identify specific research challenges and opportunities that these models present, including challenges with private and temporal data, and opportunities to make data driven systems more accessible to non-experts. We make our code and experiments publicly available at: https://github.com/HazyResearch/fm_data_tasks.
    Federated learning for violence incident prediction in a simulated cross-institutional psychiatric setting. (arXiv:2205.10234v1 [cs.CL])
    Inpatient violence is a common and severe problem within psychiatry. Knowing who might become violent can influence staffing levels and mitigate severity. Predictive machine learning models can assess each patient's likelihood of becoming violent based on clinical notes. Yet, while machine learning models benefit from having more data, data availability is limited as hospitals typically do not share their data for privacy preservation. Federated Learning (FL) can overcome the problem of data limitation by training models in a decentralised manner, without disclosing data between collaborators. However, although several FL approaches exist, none of these train Natural Language Processing models on clinical notes. In this work, we investigate the application of Federated Learning to clinical Natural Language Processing, applied to the task of Violence Risk Assessment by simulating a cross-institutional psychiatric setting. We train and compare four models: two local models, a federated model and a data-centralised model. Our results indicate that the federated model outperforms the local models and has similar performance as the data-centralised model. These findings suggest that Federated Learning can be used successfully in a cross-institutional setting and is a step towards new applications of Federated Learning based on clinical notes
    Enhanced Temporal Knowledge Embeddings with Contextualized Language Representations. (arXiv:2203.09590v3 [cs.CL] UPDATED)
    Within the emerging research efforts to combine structured and unstructured knowledge, many approaches incorporate factual knowledge, e.g., available in form of structured knowledge graphs (KGs), into pre-trained language models (PLMs) and then apply the knowledge-enhanced PLMs to downstream NLP tasks. However, (1) they typically only consider \textit{static} factual knowledge, whereas, e.g., knowledge graphs (KGs) also contain \textit{temporal facts} or \textit{events} indicating evolutionary relationships among entities at different timestamps. (2) PLMs cannot be directly applied to many KG tasks, such as temporal KG completion. In this paper, we focus on \textbf{e}nhancing temporal knowledge embeddings with \textbf{co}ntextualized \textbf{la}nguage representations (ECOLA). We align structured knowledge, contained in temporal knowledge graphs, with their textual descriptions extracted from news articles, and propose a novel knowledge-text prediction task to inject the abundant information from descriptions into temporal knowledge embeddings. ECOLA jointly optimizes the knowledge-text prediction objective and the temporal knowledge embeddings, which can simultaneously take full advantage of textual and knowledge information. The proposed fusion method is model-agnostic and can be combined with potentially any temporal KG model. For training ECOLA, we introduce three temporal KG datasets with aligned textual descriptions. Experimental results on the temporal knowledge graph completion task show that ECOLA outperforms state-of-the-art temporal KG models by a large margin. The proposed datasets can serve as new temporal KG benchmarks and facilitate future research on structured and unstructured knowledge integration.
    Learning a Large Neighborhood Search Algorithm for Mixed Integer Programs. (arXiv:2107.10201v3 [math.OC] UPDATED)
    Large Neighborhood Search (LNS) is a combinatorial optimization heuristic that starts with an assignment of values for the variables to be optimized, and iteratively improves it by searching a large neighborhood around the current assignment. In this paper we consider a learning-based LNS approach for mixed integer programs (MIPs). We train a Neural Diving model to represent a probability distribution over assignments, which, together with an off-the-shelf MIP solver, generates an initial assignment. Formulating the subsequent search steps as a Markov Decision Process, we train a Neural Neighborhood Selection policy to select a search neighborhood at each step, which is searched using a MIP solver to find the next assignment. The policy network is trained using imitation learning. We propose a target policy for imitation that, given enough compute resources, is guaranteed to select the neighborhood containing the optimal next assignment amongst all possible choices for the neighborhood of a specified size. Our approach matches or outperforms all the baselines on five real-world MIP datasets with large-scale instances from diverse applications, including two production applications at Google. It achieves $2\times$ to $37.8\times$ better average primal gap than the best baseline on three of the datasets at large running times.
    Learning Task-relevant Representations for Generalization via Characteristic Functions of Reward Sequence Distributions. (arXiv:2205.10218v1 [cs.LG])
    Generalization across different environments with the same tasks is critical for successful applications of visual reinforcement learning (RL) in real scenarios. However, visual distractions -- which are common in real scenes -- from high-dimensional observations can be hurtful to the learned representations in visual RL, thus degrading the performance of generalization. To tackle this problem, we propose a novel approach, namely Characteristic Reward Sequence Prediction (CRESP), to extract the task-relevant information by learning reward sequence distributions (RSDs), as the reward signals are task-relevant in RL and invariant to visual distractions. Specifically, to effectively capture the task-relevant information via RSDs, CRESP introduces an auxiliary task -- that is, predicting the characteristic functions of RSDs -- to learn task-relevant representations, because we can well approximate the high-dimensional distributions by leveraging the corresponding characteristic functions. Experiments demonstrate that CRESP significantly improves the performance of generalization on unseen environments, outperforming several state-of-the-arts on DeepMind Control tasks with different visual distractions.
    Explanatory machine learning for sequential human teaching. (arXiv:2205.10250v1 [cs.AI])
    The topic of comprehensibility of machine-learned theories has recently drawn increasing attention. Inductive Logic Programming (ILP) uses logic programming to derive logic theories from small data based on abduction and induction techniques. Learned theories are represented in the form of rules as declarative descriptions of obtained knowledge. In earlier work, the authors provided the first evidence of a measurable increase in human comprehension based on machine-learned logic rules for simple classification tasks. In a later study, it was found that the presentation of machine-learned explanations to humans can produce both beneficial and harmful effects in the context of game learning. We continue our investigation of comprehensibility by examining the effects of the ordering of concept presentations on human comprehension. In this work, we examine the explanatory effects of curriculum order and the presence of machine-learned explanations for sequential problem-solving. We show that 1) there exist tasks A and B such that learning A before B has a better human comprehension with respect to learning B before A and 2) there exist tasks A and B such that the presence of explanations when learning A contributes to improved human comprehension when subsequently learning B. We propose a framework for the effects of sequential teaching on comprehension based on an existing definition of comprehensibility and provide evidence for support from data collected in human trials. Empirical results show that sequential teaching of concepts with increasing complexity a) has a beneficial effect on human comprehension and b) leads to human re-discovery of divide-and-conquer problem-solving strategies, and c) studying machine-learned explanations allows adaptations of human problem-solving strategy with better performance.
    A Review of Safe Reinforcement Learning: Methods, Theory and Applications. (arXiv:2205.10330v1 [cs.AI])
    Reinforcement learning has achieved tremendous success in many complex decision making tasks. When it comes to deploying RL in the real world, safety concerns are usually raised, leading to a growing demand for safe reinforcement learning algorithms, such as in autonomous driving and robotics scenarios. While safety control has a long history, the study of safe RL algorithms is still in the early stages. To establish a good foundation for future research in this thread, in this paper, we provide a review for safe RL from the perspectives of methods, theory and applications. Firstly, we review the progress of safe RL from five dimensions and come up with five problems that are crucial for safe RL being deployed in real-world applications, coined as "2H3W". Secondly, we analyze the theory and algorithm progress from the perspectives of answering the "2H3W" problems. Then, the sample complexity of safe RL methods is reviewed and discussed, followed by an introduction of the applications and benchmarks of safe RL algorithms. Finally, we open the discussion of the challenging problems in safe RL, hoping to inspire more future research on this thread. To advance the study of safe RL algorithms, we release a benchmark suite, an open-sourced repository containing the implementations of major safe RL algorithms, along with tutorials at the link: https://github.com/chauncygu/Safe-Reinforcement-Learning-Baselines.git.
    Automated Scoring for Reading Comprehension via In-context BERT Tuning. (arXiv:2205.09864v1 [cs.LG])
    Automated scoring of open-ended student responses has the potential to significantly reduce human grader effort. Recent advances in automated scoring often leverage textual representations based on pre-trained language models such as BERT and GPT as input to scoring models. Most existing approaches train a separate model for each item/question, which is suitable for scenarios such as essay scoring where items can be quite different from one another. However, these approaches have two limitations: 1) they fail to leverage item linkage for scenarios such as reading comprehension where multiple items may share a reading passage; 2) they are not scalable since storing one model per item becomes difficult when models have a large number of parameters. In this paper, we report our (grand prize-winning) solution to the National Assessment of Education Progress (NAEP) automated scoring challenge for reading comprehension. Our approach, in-context BERT fine-tuning, produces a single shared scoring model for all items with a carefully-designed input structure to provide contextual information on each item. We demonstrate the effectiveness of our approach via local evaluations using the training dataset provided by the challenge. We also discuss the biases, common error types, and limitations of our approach.
    Posterior Refinement Improves Sample Efficiency in Bayesian Neural Networks. (arXiv:2205.10041v1 [cs.LG])
    Monte Carlo (MC) integration is the de facto method for approximating the predictive distribution of Bayesian neural networks (BNNs). But, even with many MC samples, Gaussian-based BNNs could still yield bad predictive performance due to the posterior approximation's error. Meanwhile, alternatives to MC integration tend to be more expensive or biased. In this work, we experimentally show that the key to good MC-approximated predictive distributions is the quality of the approximate posterior itself. However, previous methods for obtaining accurate posterior approximations are expensive and non-trivial to implement. We, therefore, propose to refine Gaussian approximate posteriors with normalizing flows. When applied to last-layer BNNs, it yields a simple \emph{post hoc} method for improving pre-existing parametric approximations. We show that the resulting posterior approximation is competitive with even the gold-standard full-batch Hamiltonian Monte Carlo.
    Algorithms for Weak Optimal Transport with an Application to Economics. (arXiv:2205.09825v1 [stat.ML])
    The theory of weak optimal transport (WOT), introduced by [Gozlan et al., 2017], generalizes the classic Monge-Kantorovich framework by allowing the transport cost between one point and the points it is matched with to be nonlinear. In the so-called barycentric version of WOT, the cost for transporting a point $x$ only depends on $x$ and on the barycenter of the points it is matched with. This aggregation property of WOT is appealing in machine learning, economics and finance. Yet algorithms to compute WOT have only been developed for the special case of quadratic barycentric WOT, or depend on neural networks with no guarantee on the computed value and matching. The main difficulty lies in the transportation constraints which are costly to project onto. In this paper, we propose to use mirror descent algorithms to solve the primal and dual versions of the WOT problem. We also apply our algorithms to the variant of WOT introduced by [Chon\'e et al., 2022] where mass is distributed from one space to another through unnormalized kernels (WOTUK). We empirically compare the solutions of WOT and WOTUK with classical OT. We illustrate our numerical methods to the economic framework of [Chon\'e and Kramarz, 2021], namely the matching between workers and firms on labor markets.
    Speeding up PCA with priming. (arXiv:2109.03709v3 [cs.LG] UPDATED)
    We introduce primed-PCA (pPCA), a two-step algorithm for speeding up the approximation of principal components. This algorithm first runs any approximate-PCA method to get an initial estimate of the principal components (priming), and then applies an exact PCA in the subspace they span. Since this subspace is of small dimension in any practical use, the second step is extremely cheap computationally. Nonetheless, it improves accuracy significantly for a given computational budget across datasets. In this setup, the purpose of the priming is to narrow down the search space, and prepare the data for the second step, an exact calculation. We show formally that pPCA improves upon the priming algorithm under very mild conditions, and we provide experimental validation on both synthetic and real large-scale datasets showing that it systematically translates to improved performance. In our experiments we prime pPCA by several approximate algorithms and report an average speedup by a factor of 7.2 over Oja's rule, and a factor of 10.5 over EigenGame.
    Track Boosting and Synthetic Data Aided Drone Detection. (arXiv:2111.12389v5 [cs.CV] UPDATED)
    This is the paper for the first place winning solution of the Drone vs. Bird Challenge, organized by AVSS 2021. As the usage of drones increases with lowered costs and improved drone technology, drone detection emerges as a vital object detection task. However, detecting distant drones under unfavorable conditions, namely weak contrast, long-range, low visibility, requires effective algorithms. Our method approaches the drone detection problem by fine-tuning a YOLOv5 model with real and synthetically generated data using a Kalman-based object tracker to boost detection confidence. Our results indicate that augmenting the real data with an optimal subset of synthetic data can increase the performance. Moreover, temporal information gathered by object tracking methods can increase performance further.
    Robust Multi-Task Learning and Online Refinement for Spacecraft Pose Estimation across Domain Gap. (arXiv:2203.04275v3 [cs.CV] UPDATED)
    This work presents Spacecraft Pose Network v2 (SPNv2), a Convolutional Neural Network (CNN) for pose estimation of noncooperative spacecraft across domain gap. SPNv2 is a multi-scale, multi-task CNN which consists of a shared multi-scale feature encoder and multiple prediction heads that perform different tasks on a shared feature output. These tasks are all related to detection and pose estimation of a target spacecraft from an image, such as prediction of pre-defined satellite keypoints, direct pose regression, and binary segmentation of the satellite foreground. It is shown that by jointly training on different yet related tasks with extensive data augmentations on synthetic images only, the shared encoder learns features that are common across image domains that have fundamentally different visual characteristics compared to synthetic images. This work also introduces Online Domain Refinement (ODR) which refines the parameters of the normalization layers of SPNv2 on the target domain images online at deployment. Specifically, ODR performs self-supervised entropy minimization of the predicted satellite foreground, thereby improving the CNN's performance on the target domain images without their pose labels and with minimal computational efforts. The GitHub repository for SPNv2 is available at \url{https://github.com/tpark94/spnv2}.
    Content-Context Factorized Representations for Automated Speech Recognition. (arXiv:2205.09872v1 [eess.AS])
    Deep neural networks have largely demonstrated their ability to perform automated speech recognition (ASR) by extracting meaningful features from input audio frames. Such features, however, may consist not only of information about the spoken language content, but also may contain information about unnecessary contexts such as background noise and sounds or speaker identity, accent, or protected attributes. Such information can directly harm generalization performance, by introducing spurious correlations between the spoken words and the context in which such words were spoken. In this work, we introduce an unsupervised, encoder-agnostic method for factoring speech-encoder representations into explicit content-encoding representations and spurious context-encoding representations. By doing so, we demonstrate improved performance on standard ASR benchmarks, as well as improved performance in both real-world and artificially noisy ASR scenarios.
    Delator: Automatic Detection of Money Laundering Evidence on Transaction Graphs via Neural Networks. (arXiv:2205.10293v1 [cs.LG])
    Money laundering is one of the most relevant criminal activities today, due to its potential to cause massive financial losses to governments, banks, etc. We propose DELATOR, a new CAAT (computer-assisted audit technology) to detect money laundering activities based on neural network models that encode bank transfers as a large-scale temporal graph. In collaboration with a Brazilian bank, we design and apply an evaluation strategy to quantify DELATOR's performance on historic data comprising millions of clients. DELATOR outperforms an off-the-shelf solution from Amazon AWS by 18.9% with respect to AUC. We conducted real experiments that led to discovery of 8 new suspicious among 100 analyzed cases, which would have been reported to the authorities under the current criteria.
    What's the Harm? Sharp Bounds on the Fraction Negatively Affected by Treatment. (arXiv:2205.10327v1 [stat.ME])
    The fundamental problem of causal inference -- that we never observe counterfactuals -- prevents us from identifying how many might be negatively affected by a proposed intervention. If, in an A/B test, half of users click (or buy, or watch, or renew, etc.), whether exposed to the standard experience A or a new one B, hypothetically it could be because the change affects no one, because the change positively affects half the user population to go from no-click to click while negatively affecting the other half, or something in between. While unknowable, this impact is clearly of material importance to the decision to implement a change or not, whether due to fairness, long-term, systemic, or operational considerations. We therefore derive the tightest-possible (i.e., sharp) bounds on the fraction negatively affected (and other related estimands) given data with only factual observations, whether experimental or observational. Naturally, the more we can stratify individuals by observable covariates, the tighter the sharp bounds. Since these bounds involve unknown functions that must be learned from data, we develop a robust inference algorithm that is efficient almost regardless of how and how fast these functions are learned, remains consistent when some are mislearned, and still gives valid conservative bounds when most are mislearned. Our methodology altogether therefore strongly supports credible conclusions: it avoids spuriously point-identifying this unknowable impact, focusing on the best bounds instead, and it permits exceedingly robust inference on these. We demonstrate our method in simulation studies and in a case study of career counseling for the unemployed.
    Multi-Graph Convolutional-Recurrent Neural Network (MGC-RNN) for Short-Term Forecasting of Transit Passenger Flow. (arXiv:2107.13226v2 [cs.LG] UPDATED)
    Short-term forecasting of passenger flow is critical for transit management and crowd regulation. Spatial dependencies, temporal dependencies, inter-station correlations driven by other latent factors, and exogenous factors bring challenges to the short-term forecasts of passenger flow of urban rail transit networks. An innovative deep learning approach, Multi-Graph Convolutional-Recurrent Neural Network (MGC-RNN) is proposed to forecast passenger flow in urban rail transit systems to incorporate these complex factors. We propose to use multiple graphs to encode the spatial and other heterogenous inter-station correlations. The temporal dynamics of the inter-station correlations are also modeled via the proposed multi-graph convolutional-recurrent neural network structure. Inflow and outflow of all stations can be collectively predicted with multiple time steps ahead via a sequence to sequence(seq2seq) architecture. The proposed method is applied to the short-term forecasts of passenger flow in Shenzhen Metro, China. The experimental results show that MGC-RNN outperforms the benchmark algorithms in terms of forecasting accuracy. Besides, it is found that the inter-station driven by network distance, network structure, and recent flow patterns are significant factors for passenger flow forecasting. Moreover, the architecture of LSTM-encoder-decoder can capture the temporal dependencies well. In general, the proposed framework could provide multiple views of passenger flow dynamics for fine prediction and exhibit a possibility for multi-source heterogeneous data fusion in the spatiotemporal forecast tasks.
    Adaptor: Objective-Centric Adaptation Framework for Language Models. (arXiv:2203.03989v2 [cs.CL] UPDATED)
    Progress in natural language processing research is catalyzed by the possibilities given by the widespread software frameworks. This paper introduces Adaptor library that transposes the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training process by applications of selected objectives. We survey research directions that can benefit from enhanced objective-centric experimentation in multitask training, custom objectives development, dynamic training curricula, or domain adaptation. Adaptor aims to ease reproducibility of these research directions in practice. Finally, we demonstrate the practical applicability of Adaptor in selected unsupervised domain adaptation scenarios.
    Capturing cross-session neural population variability through self-supervised identification of consistent neuron ensembles. (arXiv:2205.09829v1 [q-bio.NC])
    Decoding stimuli or behaviour from recorded neural activity is a common approach to interrogate brain function in research, and an essential part of brain-computer and brain-machine interfaces. Reliable decoding even from small neural populations is possible because high dimensional neural population activity typically occupies low dimensional manifolds that are discoverable with suitable latent variable models. Over time however, drifts in activity of individual neurons and instabilities in neural recording devices can be substantial, making stable decoding over days and weeks impractical. While this drift cannot be predicted on an individual neuron level, population level variations over consecutive recording sessions such as differing sets of neurons and varying permutations of consistent neurons in recorded data may be learnable when the underlying manifold is stable over time. Classification of consistent versus unfamiliar neurons across sessions and accounting for deviations in the order of consistent recording neurons in recording datasets over sessions of recordings may then maintain decoding performance. In this work we show that self-supervised training of a deep neural network can be used to compensate for this inter-session variability. As a result, a sequential autoencoding model can maintain state-of-the-art behaviour decoding performance for completely unseen recording sessions several days into the future. Our approach only requires a single recording session for training the model, and is a step towards reliable, recalibration-free brain computer interfaces.
    Evolving SimGANs to Improve Abnormal Electrocardiogram Classification. (arXiv:2205.10116v1 [cs.NE])
    Machine Learning models are used in a wide variety of domains. However, machine learning methods often require a large amount of data in order to be successful. This is especially troublesome in domains where collecting real-world data is difficult and/or expensive. Data simulators do exist for many of these domains, but they do not sufficiently reflect the real world data due to factors such as a lack of real-world noise. Recently generative adversarial networks (GANs) have been modified to refine simulated image data into data that better fits the real world distribution, using the SimGAN method. While evolutionary computing has been used for GAN evolution, there are currently no frameworks that can evolve a SimGAN. In this paper we (1) extend the SimGAN method to refine one-dimensional data, (2) modify Easy Cartesian Genetic Programming (ezCGP), an evolutionary computing framework, to create SimGANs that more accurately refine simulated data, and (3) create new feature-based quantitative metrics to evaluate refined data. We also use our framework to augment an electrocardiogram (ECG) dataset, a domain that suffers from the issues previously mentioned. In particular, while healthy ECGs can be simulated there are no current simulators of abnormal ECGs. We show that by using an evolved SimGAN to refine simulated healthy ECG data to mimic real-world abnormal ECGs, we can improve the accuracy of abnormal ECG classifiers.
    LeNSE: Learning To Navigate Subgraph Embeddings for Large-Scale Combinatorial Optimisation. (arXiv:2205.10106v1 [cs.LG])
    Combinatorial Optimisation problems arise in several application domains and are often formulated in terms of graphs. Many of these problems are NP-hard, but exact solutions are not always needed. Several heuristics have been developed to provide near-optimal solutions; however, they do not typically scale well with the size of the graph. We propose a low-complexity approach for identifying a (possibly much smaller) subgraph of the original graph where the heuristics can be run in reasonable time and with a high likelihood of finding a global near-optimal solution. The core component of our approach is LeNSE, a reinforcement learning algorithm that learns how to navigate the space of possible subgraphs using an Euclidean subgraph embedding as its map. To solve CO problems, LeNSE is provided with a discriminative embedding trained using any existing heuristics using only on a small portion of the original graph. When tested on three problems (vertex cover, max-cut and influence maximisation) using real graphs with up to $10$ million edges, LeNSE identifies small subgraphs yielding solutions comparable to those found by running the heuristics on the entire graph, but at a fraction of the total run time.
    A Unified Approach to Synchronization Problems over Subgroups of the Orthogonal Group. (arXiv:2009.07514v2 [math.OC] UPDATED)
    The problem of synchronization over a group $\mathcal{G}$ aims to estimate a collection of group elements $G^*_1, \dots, G^*_n \in \mathcal{G}$ based on noisy observations of a subset of all pairwise ratios of the form $G^*_i {G^*_j}^{-1}$. Such a problem has gained much attention recently and finds many applications across a wide range of scientific and engineering areas. In this paper, we consider the class of synchronization problems in which the group is a closed subgroup of the orthogonal group. This class covers many group synchronization problems that arise in practice. Our contribution is fivefold. First, we propose a unified approach for solving this class of group synchronization problems, which consists of a suitable initialization step and an iterative refinement step based on the generalized power method, and show that it enjoys a strong theoretical guarantee on the estimation error under certain assumptions on the group, measurement graph, noise, and initialization. Second, we formulate two geometric conditions that are required by our approach and show that they hold for various practically relevant subgroups of the orthogonal group. The conditions are closely related to the error-bound geometry of the subgroup -- an important notion in optimization. Third, we verify the assumptions on the measurement graph and noise for standard random graph and random matrix models. Fourth, based on the classic notion of metric entropy, we develop and analyze a novel spectral-type estimator. Finally, we show via extensive numerical experiments that our proposed non-convex approach outperforms existing approaches in terms of computational speed, scalability, and/or estimation error.
    Global and Individualized Community Detection in Inhomogeneous Multilayer Networks. (arXiv:2012.00933v3 [math.ST] UPDATED)
    In network applications, it has become increasingly common to obtain datasets in the form of multiple networks observed on the same set of subjects, where each network is obtained in a related but different experiment condition or application scenario. Such datasets can be modeled by multilayer networks where each layer is a separate network itself while different layers are associated and share some common information. The present paper studies community detection in a stylized yet informative inhomogeneous multilayer network model. In our model, layers are generated by different stochastic block models, the community structures of which are (random) perturbations of a common global structure while the connecting probabilities in different layers are not related. Focusing on the symmetric two block case, we establish minimax rates for both global estimation of the common structure and individualized estimation of layer-wise community structures. Both minimax rates have sharp exponents. In addition, we provide an efficient algorithm that is simultaneously asymptotic minimax optimal for both estimation tasks under mild conditions. The optimal rates depend on the parity of the number of most informative layers, a phenomenon that is caused by inhomogeneity across layers. The method is extended to handle multiple and potentially asymmetric community cases. We demonstrate its effectiveness on both simulated examples and a real multi-modal single-cell dataset.
    Learning List-wise Representation in Reinforcement Learning for Ads Allocation with Multiple Auxiliary Tasks. (arXiv:2204.00888v2 [cs.LG] UPDATED)
    With the recent prevalence of reinforcement learning (RL), there have been tremendous interests in utilizing RL for ads allocation in recommendation platforms (e.g., e-commerce and news feed sites). To achieve better allocation, the input of recent RL-based ads allocation methods is upgraded from point-wise single item to list-wise item arrangement. However, this also results in a high-dimensional space of state-action pairs, making it difficult to learn list-wise representations with good generalization ability. This further hinders the exploration of RL agents and causes poor sample efficiency. To address this problem, we propose a novel RL-based approach for ads allocation which learns better list-wise representations by leveraging task-specific signals on Meituan food delivery platform. Specifically, we propose three different auxiliary tasks based on reconstruction, prediction, and contrastive learning respectively according to prior domain knowledge on ads allocation. We conduct extensive experiments on Meituan food delivery platform to evaluate the effectiveness of the proposed auxiliary tasks. Both offline and online experimental results show that the proposed method can learn better list-wise representations and achieve higher revenue for the platform compared to the state-of-the-art baselines.
    Kernel Normalized Convolutional Networks. (arXiv:2205.10089v1 [cs.LG])
    Existing deep convolutional neural network (CNN) architectures frequently rely upon batch normalization (BatchNorm) to effectively train the model. BatchNorm significantly improves model performance, but performs poorly with smaller batch sizes. To address this limitation, we propose kernel normalization and kernel normalized convolutional layers, and incorporate them into kernel normalized convolutional networks (KNConvNets) as the main building blocks. We implement KNConvNets corresponding to the state-of-the-art CNNs such as ResNet and DenseNet while forgoing BatchNorm layers. Through extensive experiments, we illustrate that KNConvNets consistently outperform their batch, group, and layer normalized counterparts in terms of both accuracy and convergence rate while maintaining competitive computational efficiency.
    Edge Rewiring Goes Neural: Boosting Network Resilience without Rich Features. (arXiv:2110.09035v2 [cs.LG] UPDATED)
    Improving the resilience of a network is a fundamental problem in network science, which protects the underlying system from natural disasters and malicious attacks. This is traditionally achieved via successive degree-preserving edge rewiring operations, with the major limitation of being transductive. Inductively solving graph-related tasks with sequential actions is accomplished by adopting graph neural networks (GNNs) coupled with reinforcement learning under the scenario with rich graph features. However, such frameworks cannot be directly applied to resilience tasks where only pure topological structure is available. In this case, GNNs can barely learn useful information, resulting in prohibitive difficulty in making actions for successively rewiring edges under a reinforcement learning context.In this paper, we study in depth the reasons why typical GNNs cause such failure. Based on this investigation, we propose \textbf{ResiNet}, the first end-to-end trainable inductive framework to discover \textbf{Resi}lient \textbf{Net}work topologies while balancing network utility. To this end, we reformulate resilience optimization as an MDP equipped with edge rewiring action space, and propose a pure topology-oriented variant of GNN called \textbf{Fi}lt\textbf{r}ation \textbf{e}nhanced \textbf{G}raph \textbf{N}eural \textbf{N}etwork (\textbf{FireGNN}), which can learn from graphs without rich features. Extensive experiments demonstrate that ResiNet achieves a near-optimal resilience gain on various graphs while balancing the utility, and outperforms existing approaches by a large margin.
    Proposition-Level Clustering for Multi-Document Summarization. (arXiv:2112.08770v2 [cs.CL] UPDATED)
    Text clustering methods were traditionally incorporated into multi-document summarization (MDS) as a means for coping with considerable information repetition. Particularly, clusters were leveraged to indicate information saliency as well as to avoid redundancy. Such prior methods focused on clustering sentences, even though closely related sentences usually contain also non-aligned parts. In this work, we revisit the clustering approach, grouping together sub-sentential propositions, aiming at more precise information alignment. Specifically, our method detects salient propositions, clusters them into paraphrastic clusters, and generates a representative sentence for each cluster via text fusion. Our summarization method improves over the previous state-of-the-art MDS method in the DUC 2004 and TAC 2011 datasets, both in automatic ROUGE scores and human preference.
    Time Series Anomaly Detection via Reinforcement Learning-Based Model Selection. (arXiv:2205.09884v1 [cs.LG])
    Time series anomaly detection is of critical importance for the reliable and efficient operation of real-world systems. Many anomaly detection models have been developed throughout the years based on various assumptions regarding anomaly characteristics. However, due to the complex nature of real-world data, different anomalies within a time series usually have diverse profiles supporting different anomaly assumptions, making it difficult to find a single anomaly detector that can consistently beat all other models. In this work, to harness the benefits of different base models, we assume that a pool of anomaly detection models is accessible and propose to utilize reinforcement learning to dynamically select a candidate model from these base models. Experiments on real-world data have been implemented. It is demonstrated that the proposed strategy can outperforms all baseline models in terms of overall performance.
    Seeking entropy: complex behavior from intrinsic motivation to occupy action-state path space. (arXiv:2205.10316v1 [cs.AI])
    Intrinsic motivation generates behaviors that do not necessarily lead to immediate reward, but help exploration and learning. Here we show that agents having the sole goal of maximizing occupancy of future actions and states, that is, moving and exploring on the long term, are capable of complex behavior without any reference to external rewards. We find that action-state path entropy is the only measure consistent with additivity and other intuitive properties of expected future action-state path occupancy. We provide analytical expressions that relate the optimal policy with the optimal state-value function, from where we prove uniqueness of the solution of the associated Bellman equation and convergence of our algorithm to the optimal state-value function. Using discrete and continuous state tasks, we show that `dancing', hide-and-seek and a basic form of altruistic behavior naturally result from entropy seeking without external rewards. Intrinsically motivated agents can objectively determine what states constitute rewards, exploiting them to ultimately maximize action-state path entropy.
    Concurrent Policy Blending and System Identification for Generalized Assistive Control. (arXiv:2205.09836v1 [cs.RO])
    In this work, we address the problem of solving complex collaborative robotic tasks subject to multiple varying parameters. Our approach combines simultaneous policy blending with system identification to create generalized policies that are robust to changes in system parameters. We employ a blending network whose state space relies solely on parameter estimates from a system identification technique. As a result, this blending network learns how to handle parameter changes instead of trying to learn how to solve the task for a generalized parameter set simultaneously. We demonstrate our scheme's ability on a collaborative robot and human itching task in which the human has motor impairments. We then showcase our approach's efficiency with a variety of system identification techniques when compared to standard domain randomization.
    Converting Artificial Neural Networks to Spiking Neural Networks via Parameter Calibration. (arXiv:2205.10121v1 [cs.NE])
    Spiking Neural Network (SNN), originating from the neural behavior in biology, has been recognized as one of the next-generation neural networks. Conventionally, SNNs can be obtained by converting from pre-trained Artificial Neural Networks (ANNs) by replacing the non-linear activation with spiking neurons without changing the parameters. In this work, we argue that simply copying and pasting the weights of ANN to SNN inevitably results in activation mismatch, especially for ANNs that are trained with batch normalization (BN) layers. To tackle the activation mismatch issue, we first provide a theoretical analysis by decomposing local conversion error to clipping error and flooring error, and then quantitatively measure how this error propagates throughout the layers using the second-order analysis. Motivated by the theoretical results, we propose a set of layer-wise parameter calibration algorithms, which adjusts the parameters to minimize the activation mismatch. Extensive experiments for the proposed algorithms are performed on modern architectures and large-scale tasks including ImageNet classification and MS COCO detection. We demonstrate that our method can handle the SNN conversion with batch normalization layers and effectively preserve the high accuracy even in 32 time steps. For example, our calibration algorithms can increase up to 65% accuracy when converting VGG-16 with BN layers.
    Personalized Federated Learning with Adaptive Batchnorm for Healthcare. (arXiv:2112.00734v3 [cs.LG] UPDATED)
    There is a growing interest in applying machine learning techniques to healthcare. Recently, federated learning (FL) is gaining popularity since it allows researchers to train powerful models without compromising data privacy and security. However, the performance of existing FL approaches often deteriorates when encountering non-iid situations where there exist distribution gaps among clients, and few previous efforts focus on personalization in healthcare. In this article, we propose FedAP to tackle domain shifts and then obtain personalized models for local clients. FedAP learns the similarity between clients based on the statistics of the batch normalization layers while preserving the specificity of each client with different local batch normalization. Comprehensive experiments on five healthcare benchmarks demonstrate that FedAP achieves better accuracy compared to state-of-the-art methods (e.g., 10% accuracy improvement for PAMAP2) with faster convergence speed.
    Lossless Acceleration for Seq2seq Generation with Aggressive Decoding. (arXiv:2205.10350v1 [cs.CL])
    We study lossless acceleration for seq2seq generation with a novel decoding algorithm -- Aggressive Decoding. Unlike the previous efforts (e.g., non-autoregressive decoding) speeding up seq2seq generation at the cost of quality loss, our approach aims to yield the identical (or better) generation compared with autoregressive decoding but in a significant speedup, achieved by innovative cooperation of aggressive decoding and verification that are both efficient due to parallel computing. We propose two Aggressive Decoding paradigms for 2 kinds of seq2seq tasks: 1) For the seq2seq tasks whose inputs and outputs are highly similar (e.g., Grammatical Error Correction), we propose Input-guided Aggressive Decoding (IAD) that aggressively copies from the input sentence as drafted decoded tokens to verify in parallel; 2) For other general seq2seq tasks (e.g., Machine Translation), we propose Generalized Aggressive Decoding (GAD) that first employs an additional non-autoregressive decoding model for aggressive decoding and then verifies in parallel in the autoregressive manner. We test Aggressive Decoding on the most popular 6-layer Transformer model on GPU in multiple seq2seq tasks: 1) For IAD, we show that it can introduce a 7x-9x speedup for the Transformer in Grammatical Error Correction and Text Simplification tasks with the identical results as greedy decoding; 2) For GAD, we observe a 3x-5x speedup with the identical or even better quality in two important seq2seq tasks: Machine Translation and Abstractive Summarization. Moreover, Aggressive Decoding can benefit even more from stronger computing devices that are better at parallel computing. Given the lossless quality as well as significant and promising speedup, we believe Aggressive Decoding may potentially evolve into a de facto standard for efficient and lossless seq2seq generation in the near future.
    Anomaly Detection for Multivariate Time Series on Large-scale Fluid Handling Plant Using Two-stage Autoencoder. (arXiv:2205.09924v1 [cs.LG])
    This paper focuses on anomaly detection for multivariate time series data in large-scale fluid handling plants with dynamic components, such as power generation, water treatment, and chemical plants, where signals from various physical phenomena are observed simultaneously. In these plants, the need for anomaly detection techniques is increasing in order to reduce the cost of operation and maintenance, in view of a decline in the number of skilled engineers and a shortage of manpower. However, considering the complex behavior of high-dimensional signals and the demand for interpretability, the techniques constitute a major challenge. We introduce a Two-Stage AutoEncoder (TSAE) as an anomaly detection method suitable for such plants. This is a simple autoencoder architecture that makes anomaly detection more interpretable and more accurate, in which based on the premise that plant signals can be separated into two behaviors that have almost no correlation with each other, the signals are separated into long-term and short-term components in a stepwise manner, and the two components are trained independently to improve the inference capability for normal signals. Through experiments on two publicly available datasets of water treatment systems, we have confirmed the high detection performance, the validity of the premise, and that the model behavior was as intended, i.e., the technical effectiveness of TSAE.
    SADAM: Stochastic Adam, A Stochastic Operator for First-Order Gradient-based Optimizer. (arXiv:2205.10247v1 [cs.LG])
    In this work, to efficiently help escape the stationary and saddle points, we propose, analyze, and generalize a stochastic strategy performed as an operator for a first-order gradient descent algorithm in order to increase the target accuracy and reduce time consumption. Unlike existing algorithms, the proposed stochastic the strategy does not require any batches and sampling techniques, enabling efficient implementation and maintaining the initial first-order optimizer's convergence rate, but provides an incomparable improvement of target accuracy when optimizing the target functions. In short, the proposed strategy is generalized, applied to Adam, and validated via the decomposition of biomedical signals using Deep Matrix Fitting and another four peer optimizers. The validation results show that the proposed random strategy can be easily generalized for first-order optimizers and efficiently improve the target accuracy.
    Cross DQN: Cross Deep Q Network for Ads Allocation in Feed. (arXiv:2109.04353v4 [cs.LG] UPDATED)
    E-commerce platforms usually display a mixed list of ads and organic items in feed. One key problem is to allocate the limited slots in the feed to maximize the overall revenue as well as improve user experience, which requires a good model for user preference. Instead of modeling the influence of individual items on user behaviors, the arrangement signal models the influence of the arrangement of items and may lead to a better allocation strategy. However, most of previous strategies fail to model such a signal and therefore result in suboptimal performance. In addition, the percentage of ads exposed (PAE) is an important indicator in ads allocation. Excessive PAE hurts user experience while too low PAE reduces platform revenue. Therefore, how to constrain the PAE within a certain range while keeping personalized recommendation under the PAE constraint is a challenge. In this paper, we propose Cross Deep Q Network (Cross DQN) to extract the crucial arrangement signal by crossing the embeddings of different items and modeling the crossed sequence by multi-channel attention. Besides, we propose an auxiliary loss for batch-level constraint on PAE to tackle the above-mentioned challenge. Our model results in higher revenue and better user experience than state-of-the-art baselines in offline experiments. Moreover, our model demonstrates a significant improvement in the online A/B test and has been fully deployed on Meituan feed to serve more than 300 millions of customers.
    Learning Interface Conditions in Domain Decomposition Solvers. (arXiv:2205.09833v1 [cs.LG])
    Domain decomposition methods are widely used and effective in the approximation of solutions to partial differential equations. Yet the optimal construction of these methods requires tedious analysis and is often available only in simplified, structured-grid settings, limiting their use for more complex problems. In this work, we generalize optimized Schwarz domain decomposition methods to unstructured-grid problems, using Graph Convolutional Neural Networks (GCNNs) and unsupervised learning to learn optimal modifications at subdomain interfaces. A key ingredient in our approach is an improved loss function, enabling effective training on relatively small problems, but robust performance on arbitrarily large problems, with computational cost linear in problem size. The performance of the learned linear solvers is compared with both classical and optimized domain decomposition algorithms, for both structured- and unstructured-grid problems.
    On Jointly Optimizing Partial Offloading and SFC Mapping: A Cooperative Dual-agent Deep Reinforcement Learning Approach. (arXiv:2205.09925v1 [cs.AI])
    Multi-access edge computing (MEC) and network function virtualization (NFV) are promising technologies to support emerging IoT applications, especially those computation-intensive. In NFV-enabled MEC environment, service function chain (SFC), i.e., a set of ordered virtual network functions (VNFs), can be mapped on MEC servers. Mobile devices (MDs) can offload computation-intensive applications, which can be represented by SFCs, fully or partially to MEC servers for remote execution. This paper studies the partial offloading and SFC mapping joint optimization (POSMJO) problem in an NFV-enabled MEC system, where an incoming task can be partitioned into two parts, one for local execution and the other for remote execution. The objective is to minimize the average cost in the long term which is a combination of execution delay, MD's energy consumption, and usage charge for edge computing. This problem consists of two closely related decision-making steps, namely task partition and VNF placement, which is highly complex and quite challenging. To address this, we propose a cooperative dual-agent deep reinforcement learning (CDADRL) algorithm, where we design a framework enabling interaction between two agents. Simulation results show that the proposed algorithm outperforms three combinations of deep reinforcement learning algorithms in terms of cumulative and average episodic rewards and it overweighs a number of baseline algorithms with respect to execution delay, energy consumption, and usage charge.
    Almost exact recovery in noisy semi-supervised learning. (arXiv:2007.14717v3 [cs.LG] UPDATED)
    Graph-based semi-supervised learning methods combine the graph structure and labeled data to classify unlabeled data. In this work, we study the effect of a noisy oracle on classification. In particular, we derive the Maximum A Posteriori (MAP) estimator for clustering a Degree Corrected Stochastic Block Model (DC-SBM) when a noisy oracle reveals a fraction of the labels. We then propose an algorithm derived from a continuous relaxation of the MAP, and we establish its consistency. Numerical experiments show that our approach achieves promising performance on synthetic and real data sets, even in the case of very noisy labeled data.
    Long-Range Transformers for Dynamic Spatiotemporal Forecasting. (arXiv:2109.12218v2 [cs.LG] UPDATED)
    Multivariate Time Series Forecasting focuses on the prediction of future values based on historical context. State-of-the-art sequence-to-sequence models rely on neural attention between timesteps, which allows for temporal learning but fails to consider distinct spatial relationships between variables. In contrast, methods based on graph neural networks explicitly model variable relationships. However, these methods often rely on predefined graphs and perform separate spatial and temporal updates without establishing direct connections between each variable at every timestep. This paper addresses these problems by translating multivariate forecasting into a spatiotemporal sequence formulation where each Transformer input token represents the value of a single variable at a given time. Long-Range Transformers can then learn interactions between space, time, and value information jointly along this extended sequence. Our method, which we call Spacetimeformer, achieves competitive results on benchmarks from traffic forecasting to electricity demand and weather prediction while learning fully-connected spatiotemporal relationships purely from data.
    EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction. (arXiv:2202.05146v3 [q-bio.BM] UPDATED)
    Predicting how a drug-like molecule binds to a specific protein target is a core problem in drug discovery. An extremely fast computational binding method would enable key applications such as fast virtual screening or drug engineering. Existing methods are computationally expensive as they rely on heavy candidate sampling coupled with scoring, ranking, and fine-tuning steps. We challenge this paradigm with EquiBind, an SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligand's bound pose and orientation. EquiBind achieves significant speed-ups and better quality compared to traditional and recent baselines. Further, we show extra improvements when coupling it with existing fine-tuning techniques at the cost of increased running time. Finally, we propose a novel and fast fine-tuning model that adjusts torsion angles of a ligand's rotatable bonds based on closed-form global minima of the von Mises angular distance to a given input atomic point cloud, avoiding previous expensive differential evolution strategies for energy minimization.
    Graph Representation Learning for Multi-Task Settings: a Meta-Learning Approach. (arXiv:2201.03326v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have become the state-of-the-art method for many applications on graph structured data. GNNs are a model for graph representation learning, which aims at learning to generate low dimensional node embeddings that encapsulate structural and feature-related information. GNNs are usually trained in an end-to-end fashion, leading to highly specialized node embeddings. While this approach achieves great results in the single-task setting, the generation of node embeddings that can be used to perform multiple tasks (with performance comparable to single-task models) is still an open problem. We propose the use of meta-learning to allow the training of a GNN model capable of producing multi-task node embeddings. In particular, we exploit the properties of optimization-based meta-learning to learn GNNs that can produce general node representations by learning parameters that can quickly (i.e. with a few steps of gradient descent) adapt to multiple tasks. Our experiments show that the embeddings produced by a model trained with our purposely designed meta-learning procedure can be used to perform multiple tasks with comparable or, surprisingly, even higher performance than both single-task and multi-task end-to-end models.
    MCMARL: Parameterizing Value Function via Mixture of Categorical Distributions for Multi-Agent Reinforcement Learning. (arXiv:2202.10134v2 [cs.LG] UPDATED)
    In cooperative multi-agent tasks, a team of agents jointly interact with an environment by taking actions, receiving a team reward and observing the next state. During the interactions, the uncertainty of environment and reward will inevitably induce stochasticity in the long-term returns and the randomness can be exacerbated with the increasing number of agents. However, such randomness is ignored by most of the existing value-based multi-agent reinforcement learning (MARL) methods, which only model the expectation of Q-value for both individual agents and the team. Compared to using the expectations of the long-term returns, it is preferable to directly model the stochasticity by estimating the returns through distributions. With this motivation, this work proposes a novel value-based MARL framework from a distributional perspective, \emph{i.e.}, parameterizing value function via \underline{M}ixture of \underline{C}ategorical distributions for MARL. Specifically, we model both individual Q-values and global Q-value with categorical distribution. To integrate categorical distributions, we define five basic operations on the distribution, which allow the generalization of expected value function factorization methods (\emph{e.g.}, VDN and QMIX) to their MCMARL variants. We further prove that our MCMARL framework satisfies \emph{Distributional-Individual-Global-Max} (DIGM) principle with respect to the expectation of distribution, which guarantees the consistency between joint and individual greedy action selections in the global Q-value and individual Q-values. Empirically, we evaluate MCMARL on both a stochastic matrix game and a challenging set of StarCraft II micromanagement tasks, showing the efficacy of our framework.
    Revisiting GANs by Best-Response Constraint: Perspective, Methodology, and Application. (arXiv:2205.10146v1 [cs.LG])
    In past years, the minimax type single-level optimization formulation and its variations have been widely utilized to address Generative Adversarial Networks (GANs). Unfortunately, it has been proved that these alternating learning strategies cannot exactly reveal the intrinsic relationship between the generator and discriminator, thus easily result in a series of issues, including mode collapse, vanishing gradients and oscillations in the training phase, etc. In this work, by investigating the fundamental mechanism of GANs from the perspective of hierarchical optimization, we propose Best-Response Constraint (BRC), a general learning framework, that can explicitly formulate the potential dependency of the generator on the discriminator. Rather than adopting these existing time-consuming bilevel iterations, we design an implicit gradient scheme with outer-product Hessian approximation as our fast solution strategy. \emph{Noteworthy, we demonstrate that even with different motivations and formulations, a variety of existing GANs ALL can be uniformly improved by our flexible BRC methodology.} Extensive quantitative and qualitative experimental results verify the effectiveness, flexibility and stability of our proposed framework.
    Towards the Generation of Synthetic Images of Palm Vein Patterns: A Review. (arXiv:2205.10179v1 [cs.CV])
    With the recent success of computer vision and deep learning, remarkable progress has been achieved on automatic personal recognition using vein biometrics. However, collecting large-scale real-world training data for palm vein recognition has turned out to be challenging, mainly due to the noise and irregular variations included at the time of acquisition. Meanwhile, existing palm vein recognition datasets are usually collected under near-infrared light, lacking detailed annotations on attributes (e.g., pose), so the influences of different attributes on vein recognition have been poorly investigated. Therefore, this paper examines the suitability of synthetic vein images generated to compensate for the urgent lack of publicly available large-scale datasets. Firstly, we present an overview of recent research progress on palm vein recognition, from the basic background knowledge to vein anatomical structure, data acquisition, public database, and quality assessment procedures. Then, we focus on the state-of-the-art methods that have allowed the generation of vascular structures for biometric purposes and the modeling of biological networks with their respective application domains. In addition, we review the existing research on the generation of style transfer and biological nature-based synthetic palm vein image algorithms. Afterward, we formalize a general flowchart for the creation of a synthetic database comparing real palm vein images and generated synthetic samples to obtain some understanding into the development of the realistic vein imaging system. Ultimately, we conclude by discussing the challenges, insights, and future perspectives in generating synthetic palm vein images for further works.
    A Computational Framework of Cortical Microcircuits Approximates Sign-concordant Random Backpropagation. (arXiv:2205.07292v2 [cs.NE] UPDATED)
    Several recent studies attempt to address the biological implausibility of the well-known backpropagation (BP) method. While promising methods such as feedback alignment, direct feedback alignment, and their variants like sign-concordant feedback alignment tackle BP's weight transport problem, their validity remains controversial owing to a set of other unsolved issues. In this work, we answer the question of whether it is possible to realize random backpropagation solely based on mechanisms observed in neuroscience. We propose a hypothetical framework consisting of a new microcircuit architecture and its supporting Hebbian learning rules. Comprising three types of cells and two types of synaptic connectivity, the proposed microcircuit architecture computes and propagates error signals through local feedback connections and supports the training of multi-layered spiking neural networks with a globally defined spiking error function. We employ the Hebbian rule operating in local compartments to update synaptic weights and achieve supervised learning in a biologically plausible manner. Finally, we interpret the proposed framework from an optimization point of view and show its equivalence to sign-concordant feedback alignment. The proposed framework is benchmarked on several datasets including MNIST and CIFAR10, demonstrating promising BP-comparable accuracy.
    Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining. (arXiv:2110.08412v2 [cs.CL] UPDATED)
    To explain NLP models, importance measures such as attention inform which inputs tokens are important for a prediction are popular. However, an open question is how well these explanations accurately reflect a model's logic, a property called faithfulness. To answer this question, we propose an new faithfulness benchmark called Recursive ROAR. This works by recursively masking allegedly important tokens and then retrain the model. The principle is, that this should result in worse model performance compared to masking random tokens. The result is a performance curve given a masking-ratio. Furthermore, we propose a summarizing metric using the area-between-curves, which allows for easy comparison across papers, models, and tasks. To provide a thorough review, we evaluate 4 different importance measures on 8 different datasets, using both LSTM-attention models and RoBERTa models. We find that the faithfulness of importance measures is both model-dependent and task-dependent. This conclusion contradicts previous evaluations in both computer vision and faithfulness of attention literature.
    AutoFedNLP: An efficient FedNLP framework. (arXiv:2205.10162v1 [cs.LG])
    Transformer-based pre-trained models have revolutionized NLP for superior performance and generality. Fine-tuning pre-trained models for downstream tasks often require private data, for which federated learning is the de-facto approach (i.e., FedNLP). However, our measurements show that FedNLP is prohibitively slow due to the large model sizes and the resultant high network/computation cost. Towards practical FedNLP, we identify as the key building blocks adapters, small bottleneck modules inserted at a variety of model layers. A key challenge is to properly configure the depth and width of adapters, to which the training speed and efficiency is highly sensitive. No silver-bullet configuration exists: the optimal choice varies across downstream NLP tasks, desired model accuracy, and client resources. A silver-bullet configuration does not exist and a non-optimal configuration could significantly slow down the training. To automate adapter configuration, we propose AutoFedNLP, a framework that enhances the existing FedNLP with two novel designs. First, AutoFedNLP progressively upgrades the adapter configuration throughout a training session. Second, AutoFedNLP continuously profiles future adapter configurations by allocating participant devices to trial groups. To minimize client-side computations, AutoFedNLP exploits the fact that a FedNLP client trains on the same samples repeatedly between consecutive changes of adapter configurations, and caches computed activations on clients. Extensive experiments show that AutoFedNLP can reduce FedNLP's model convergence delay to no more than several hours, which is up to 155.5$\times$ faster compared to vanilla FedNLP and 48$\times$ faster compared to strong baselines.
    Learning Convolutional Neural Networks in the Frequency Domain. (arXiv:2204.06718v8 [cs.CV] UPDATED)
    Convolutional neural network (CNN) has achieved impressive success in computer vision during the past few decades. The image convolution operation helps CNNs to get good performance on image-related tasks. However, the image convolution has high computation complexity and hard to be implemented. This paper proposes the CEMNet, which can be trained in the frequency domain. The most important motivation of this research is that we can use the straightforward element-wise multiplication operation to replace the image convolution in the frequency domain based on the Cross-Correlation Theorem, which obviously reduces the computation complexity. We further introduce a Weight Fixation mechanism to alleviate the problem of over-fitting, and analyze the working behavior of Batch Normalization, Leaky ReLU, and Dropout in the frequency domain to design their counterparts for CEMNet. Also, to deal with complex inputs brought by Discrete Fourier Transform, we design a two-branches network structure for CEMNet. Experimental results imply that CEMNet achieves good performance on MNIST and CIFAR-10 databases.
    The developmental trajectory of object recognition robustness: children are like small adults but unlike big deep neural networks. (arXiv:2205.10144v1 [cs.CV])
    In laboratory object recognition tasks based on undistorted photographs, both adult humans and Deep Neural Networks (DNNs) perform close to ceiling. Unlike adults', whose object recognition performance is robust against a wide range of image distortions, DNNs trained on standard ImageNet (1.3M images) perform poorly on distorted images. However, the last two years have seen impressive gains in DNN distortion robustness, predominantly achieved through ever-increasing large-scale datasets$\unicode{x2014}$orders of magnitude larger than ImageNet. While this simple brute-force approach is very effective in achieving human-level robustness in DNNs, it raises the question of whether human robustness, too, is simply due to extensive experience with (distorted) visual input during childhood and beyond. Here we investigate this question by comparing the core object recognition performance of 146 children (aged 4$\unicode{x2013}$15) against adults and against DNNs. We find, first, that already 4$\unicode{x2013}$6 year-olds showed remarkable robustness to image distortions and outperform DNNs trained on ImageNet. Second, we estimated the number of $\unicode{x201C}$images$\unicode{x201D}$ children have been exposed to during their lifetime. Compared to various DNNs, children's high robustness requires relatively little data. Third, when recognizing objects children$\unicode{x2014}$like adults but unlike DNNs$\unicode{x2014}$rely heavily on shape but not on texture cues. Together our results suggest that the remarkable robustness to distortions emerges early in the developmental trajectory of human object recognition and is unlikely the result of a mere accumulation of experience with distorted visual input. Even though current DNNs match human performance regarding robustness they seem to rely on different and more data-hungry strategies to do so.
    Optimizing the Communication-Accuracy Trade-off in Federated Learning with Rate-Distortion Theory. (arXiv:2201.02664v3 [cs.LG] UPDATED)
    A significant bottleneck in federated learning (FL) is the network communication cost of sending model updates from client devices to the central server. We present a comprehensive empirical study of the statistics of model updates in FL, as well as the role and benefits of various compression techniques. Motivated by these observations, we propose a novel method to reduce the average communication cost, which is near-optimal in many use cases, and outperforms Top-K, DRIVE, 3LC and QSGD on Stack Overflow next-word prediction, a realistic and challenging FL benchmark. This is achieved by examining the problem using rate-distortion theory, and proposing distortion as a reliable proxy for model accuracy. Distortion can be more effectively used for optimizing the trade-off between model performance and communication cost across clients. We demonstrate empirically that in spite of the non-i.i.d. nature of federated learning, the rate-distortion frontier is consistent across datasets, optimizers, clients and training rounds.
    Missing Data Imputation and Acquisition with Deep Hierarchical Models and Hamiltonian Monte Carlo. (arXiv:2202.04599v2 [cs.LG] UPDATED)
    Variational Autoencoders (VAEs) have recently been highly successful at imputing and acquiring heterogeneous missing data. However, within this specific application domain, existing VAE methods are restricted by using only one layer of latent variables and strictly Gaussian posterior approximations. To address these limitations, we present HH-VAEM, a Hierarchical VAE model for mixed-type incomplete data that uses Hamiltonian Monte Carlo with automatic hyper-parameter tuning for improved approximate inference. Our experiments show that HH-VAEM outperforms existing baselines in the tasks of missing data imputation and supervised learning with missing features. Finally, we also present a sampling-based approach for efficiently computing the information gain when missing features are to be acquired with HH-VAEM. Our experiments show that this sampling-based approach is superior to alternatives based on Gaussian approximations.
    Task Relabelling for Multi-task Transfer using Successor Features. (arXiv:2205.10175v1 [cs.AI])
    Deep Reinforcement Learning has been very successful recently with various works on complex domains. Most works are concerned with learning a single policy that solves the target task, but is fixed in the sense that if the environment changes the agent is unable to adapt to it. Successor Features (SFs) proposes a mechanism that allows learning policies that are not tied to any particular reward function. In this work we investigate how SFs may be pre-trained without observing any reward in a custom environment that features resource collection, traps and crafting. After pre-training we expose the SF agents to various target tasks and see how well they can transfer to new tasks. Transferring is done without any further training on the SF agents, instead just by providing a task vector. For training the SFs we propose a task relabelling method which greatly improves the agent's performance.
    The Fairness of Credit Scoring Models. (arXiv:2205.10200v1 [stat.ML])
    In credit markets, screening algorithms aim to discriminate between good-type and bad-type borrowers. However, when doing so, they also often discriminate between individuals sharing a protected attribute (e.g. gender, age, racial origin) and the rest of the population. In this paper, we show how (1) to test whether there exists a statistically significant difference between protected and unprotected groups, which we call lack of fairness and (2) to identify the variables that cause the lack of fairness. We then use these variables to optimize the fairness-performance trade-off. Our framework provides guidance on how algorithmic fairness can be monitored by lenders, controlled by their regulators, and improved for the benefit of protected groups.
    Do Transformer Models Show Similar Attention Patterns to Task-Specific Human Gaze?. (arXiv:2205.10226v1 [cs.CL])
    Learned self-attention functions in state-of-the-art NLP models often correlate with human attention. We investigate whether self-attention in large-scale pre-trained language models is as predictive of human eye fixation patterns during task-reading as classical cognitive models of human attention. We compare attention functions across two task-specific reading datasets for sentiment analysis and relation extraction. We find the predictiveness of large-scale pre-trained self-attention for human attention depends on `what is in the tail', e.g., the syntactic nature of rare contexts. Further, we observe that task-specific fine-tuning does not increase the correlation with human task-specific reading. Through an input reduction experiment we give complementary insights on the sparsity and fidelity trade-off, showing that lower-entropy attention vectors are more faithful.
    A Novel Weighted Ensemble Learning Based Agent for the Werewolf Game. (arXiv:2205.09813v1 [cs.LG])
    Werewolf is a popular party game throughout the world, and research on its significance has progressed in recent years. The Werewolf game is based on conversation, and in order to win, participants must use all of their cognitive abilities. This communication game requires the playing agents to be very sophisticated to win. In this research, we generated a sophisticated agent to play the Werewolf game using a complex weighted ensemble learning approach. This research work aimed to estimate what other agents/players think of us in the game. The agent was developed by aggregating strategies of different participants in the AI Wolf competition and thereby learning from them using machine learning. Moreover, the agent created was able to perform much better than other competitors using very basic strategies to show the approach's effectiveness in the Werewolf game. The machine learning technique used here is not restricted to the Werewolf game but may be extended to any game that requires communication and action depending on other participants.
    Linearizing Transformer with Key-Value Memory. (arXiv:2203.12644v3 [cs.CL] UPDATED)
    Efficient transformer variants with linear time complexity have been developed to mitigate the quadratic computational overhead of the vanilla transformer. Among them are low-rank projection methods such as Linformer and kernel-based Transformers. Despite their unique merits, they usually suffer from a performance drop comparing with the vanilla transformer on many sequence generation tasks, and often fail to obtain computation gain when the generation is short. We propose MemSizer, an approach towards closing the performance gap while improving the efficiency even with short generation. It projects the source sequences into lower dimension representations like Linformer, while enjoying efficient recurrent-style incremental computation similar to kernel-based transformers. This yields linear computation time and constant memory complexity at inference time. MemSizer also employs a lightweight multi-head mechanism which renders the computation as light as a single-head model. We demonstrate that MemSizer provides an improved balance between efficiency and accuracy over the vanilla transformer and other efficient transformer variants in three typical sequence generation tasks, including machine translation, abstractive text summarization, and language modeling.
    GNN-Geo: A Graph Neural Network-based Fine-grained IP geolocation Framework. (arXiv:2112.10767v6 [cs.LG] UPDATED)
    Rule-based fine-grained IP geolocation methods are hard to generalize in computer networks which do not follow hypothetical rules. Recently, deep learning methods, like multi-layer perceptron (MLP), are tried to increase generalization capabilities. However, MLP is not so suitable for graph-structured data like networks. MLP treats IP addresses as isolated instances and ignores the connection information, which limits geolocation accuracy. In this work, we research how to increase the generalization capability with an emerging graph deep learning method - Graph Neural Network (GNN). First, IP geolocation is re-formulated as an attributed graph node regression problem. Then, we propose a GNN-based IP geolocation framework named GNN-Geo. GNN-Geo consists of a preprocessor, an encoder, messaging passing (MP) layers and a decoder. The preprocessor and encoder transform measurement data into the initial node embeddings. MP layers refine the initial node embeddings by modeling the connection information. The decoder maps the refined embeddings to nodes' locations and relieves the convergence problem of GNN by considering prior knowledge. The experiments in different real-world datasets show: the proposed GNN-Geo outperforms the state-of-art rule-based and learning-based baselines on all datasets w.r.t median error distance by 16% to 28%. This work verifies the great potential of GNN for fine-grained IP geolocation.
    Interpretable Personalization via Policy Learning with Linear Decision Boundaries. (arXiv:2003.07545v3 [cs.LG] UPDATED)
    With the rise of the digital economy and an explosion of available information on consumers, effective personalization of offers, goods, and services has become a core business focus for companies to improve revenues and maintain competitive edge. This paper studies the personalization problem through the lens of policy learning, where the goal is to learn a decision-making rule (a policy) that maps from consumer and product characteristics (features) to recommendations (actions) in order to optimize outcomes (rewards). We focus on using available historical data for offline learning with unknown data collection procedure. Importantly, in many business and medical settings, interpretability of a policy is essential. To address these challenges, we study the class of policies with linear decision boundaries and propose learning algorithms using tools from causal inference. We propose several optimization schemes to solve the associated non-convex, non-smooth optimization problem, and find that an adapted Bayesian optimization algorithm is fast and effective. We test our algorithm with extensive simulation studies and apply it to an online marketplace customer purchase dataset, where the learned policy outputs a personalized discount recommendation based on customer and product features in order to maximize gross merchandise value (GMV) for sellers. Our learned policy improves upon the platform's baseline by 88.2\% in net sales revenue, while also providing informative insights on which features are important for the decision-making process, e.g. when "Attribute 2" is large, marginal increase in GMV is low for discounts higher than 10\%. Our findings suggest that the proposed policy learning algorithm provides a promising practical approach for interpretable personalization across a wide range of applications.
    Counterfactual Temporal Point Processes. (arXiv:2111.07603v2 [cs.LG] UPDATED)
    Machine learning models based on temporal point processes are the state of the art in a wide variety of applications involving discrete events in continuous time. However, these models lack the ability to answer counterfactual questions, which are increasingly relevant as these models are being used to inform targeted interventions. In this work, our goal is to fill this gap. To this end, we first develop a causal model of thinning for temporal point processes that builds upon the Gumbel-Max structural causal model. This model satisfies a desirable counterfactual monotonicity condition, which is sufficient to identify counterfactual dynamics in the process of thinning. Then, given an observed realization of a temporal point process with a given intensity function, we develop a sampling algorithm that uses the above causal model of thinning and the superposition theorem to simulate counterfactual realizations of the temporal point process under a given alternative intensity function. Simulation experiments using synthetic and real epidemiological data show that the counterfactual realizations provided by our algorithm may give valuable insights to enhance targeted interventions.
    On Algorithmic Stability in Unsupervised Representation Learning. (arXiv:2106.05238v3 [cs.LG] UPDATED)
    In this paper, we investigate the algorithmic stability of unsupervised representation learning with deep generative models, as a function of repeated re-training on the same input data. Algorithms for learning low dimensional linear representations -- for example principal components analysis (PCA), or linear independent components analysis (ICA) -- come with guarantees that they will always reveal the same latent representations (perhaps up to an arbitrary rotation or permutation). Unfortunately, for non-linear representation learning, such as in a variational auto-encoder (VAE) model trained by stochastic gradient descent, we have no such guarantees. Recent work on identifiability in non-linear ICA have introduced a family of deep generative models that have identifiable latent representations, achieved by conditioning on side information (e.g. informative labels). We empirically evaluate the stability of these models under repeated re-estimation of parameters, and compare them to both standard VAEs and deep generative models which learn to cluster in their latent space. Surprisingly, we discover side information is not necessary for algorithmic stability: using standard quantitative measures of identifiability, we find deep generative models with latent clusterings are empirically identifiable to the same degree as models which rely on auxiliary labels. We relate these results to the possibility of identifiable non-linear ICA.
    CertiFair: A Framework for Certified Global Fairness of Neural Networks. (arXiv:2205.09927v1 [cs.LG])
    We consider the problem of whether a Neural Network (NN) model satisfies global individual fairness. Individual Fairness suggests that similar individuals with respect to a certain task are to be treated similarly by the decision model. In this work, we have two main objectives. The first is to construct a verifier which checks whether the fairness property holds for a given NN in a classification task or provide a counterexample if it is violated, i.e., the model is fair if all similar individuals are classified the same, and unfair if a pair of similar individuals are classified differently. To that end, We construct a sound and complete verifier that verifies global individual fairness properties of ReLU NN classifiers using distance-based similarity metrics. The second objective of this paper is to provide a method for training provably fair NN classifiers from unfair (biased) data. We propose a fairness loss that can be used during training to enforce fair outcomes for similar individuals. We then provide provable bounds on the fairness of the resulting NN. We run experiments on commonly used fairness datasets that are publicly available and we show that global individual fairness can be improved by 96 % without significant drop in test accuracy.
    Unintended memorisation of unique features in neural networks. (arXiv:2205.10079v1 [cs.LG])
    Neural networks pose a privacy risk due to their propensity to memorise and leak training data. We show that unique features occurring only once in training data are memorised by discriminative multi-layer perceptrons and convolutional neural networks trained on benchmark imaging datasets. We design our method for settings where sensitive training data is not available, for example medical imaging. Our setting knows the unique feature, but not the training data, model weights or the unique feature's label. We develop a score estimating a model's sensitivity to a unique feature by comparing the KL divergences of the model's output distributions given modified out-of-distribution images. We find that typical strategies to prevent overfitting do not prevent unique feature memorisation. And that images containing a unique feature are highly influential, regardless of the influence the images's other features. We also find a significant variation in memorisation with training seed. These results imply that neural networks pose a privacy risk to rarely occurring private information. This risk is more pronounced in healthcare applications since sensitive patient information can be memorised when it remains in training data due to an imperfect data sanitisation process.
    Unsupervised Out-of-Domain Detection via Pre-trained Transformers. (arXiv:2106.00948v2 [cs.CL] UPDATED)
    Deployed real-world machine learning applications are often subject to uncontrolled and even potentially malicious inputs. Those out-of-domain inputs can lead to unpredictable outputs and sometimes catastrophic safety issues. Prior studies on out-of-domain detection require in-domain task labels and are limited to supervised classification scenarios. Our work tackles the problem of detecting out-of-domain samples with only unsupervised in-domain data. We utilize the latent representations of pre-trained transformers and propose a simple yet effective method to transform features across all layers to construct out-of-domain detectors efficiently. Two domain-specific fine-tuning approaches are further proposed to boost detection accuracy. Our empirical evaluations of related methods on two datasets validate that our method greatly improves out-of-domain detection ability in a more general scenario.
    You Don't Know My Favorite Color: Preventing Dialogue Representations from Revealing Speakers' Private Personas. (arXiv:2205.10228v1 [cs.CL])
    Social chatbots, also known as chit-chat chatbots, evolve rapidly with large pretrained language models. Despite the huge progress, privacy concerns have arisen recently: training data of large language models can be extracted via model inversion attacks. On the other hand, the datasets used for training chatbots contain many private conversations between two individuals. In this work, we further investigate the privacy leakage of the hidden states of chatbots trained by language modeling which has not been well studied yet. We show that speakers' personas can be inferred through a simple neural network with high accuracy. To this end, we propose effective defense objectives to protect persona leakage from hidden states. We conduct extensive experiments to demonstrate that our proposed defense objectives can greatly reduce the attack accuracy from 37.6% to 0.5%. Meanwhile, the proposed objectives preserve language models' powerful generation ability.
    Policies for the Dynamic Traveling Maintainer Problem with Alerts. (arXiv:2105.15119v2 [math.OC] UPDATED)
    Downtime of industrial assets such as wind turbines and medical imaging devices comes at a sharp cost. To avoid such downtime costs, companies seek to initiate maintenance just before failure. Unfortunately, this is challenging for the following two reasons: On the one hand, because asset failures are notoriously difficult to predict, even in the presence of real-time monitoring devices which signal early degradation. On the other hand, because the available resources to serve a network of geographically dispersed assets are typically limited. In this paper, we propose a novel dynamic traveling maintainer problem with alerts model that incorporates these two challenges and we provide three solution approaches on how to dispatch the limited resources. Namely, we propose: (i) Greedy heuristic approaches that rank assets on urgency, proximity and economic risk; (ii) A novel traveling maintainer heuristic approach that optimizes short-term costs; and (iii) A deep reinforcement learning (DRL) approach that optimizes long-term costs. Each approach has different requirements concerning the available alert information. Experiments with small asset networks show that all methods can approximate the optimal policy when given access to complete condition information. For larger networks, the proposed methods yield competitive policies, with DRL consistently achieving the lowest costs.
    A Unified Experiment Design Approach for Cyclic and Acyclic Causal Models. (arXiv:2205.10083v1 [cs.LG])
    We study experiment design for the unique identification of the causal graph of a system where the graph may contain cycles. The presence of cycles in the structure introduces major challenges for experiment design. Unlike the case of acyclic graphs, learning the skeleton of the causal graph from observational distribution may not be possible. Furthermore, intervening on a variable does not necessarily lead to orienting all the edges incident to it. In this paper, we propose an experiment design approach that can learn both cyclic and acyclic graphs and hence, unifies the task of experiment design for both types of graphs. We provide a lower bound on the number of experiments required to guarantee the unique identification of the causal graph in the worst case, showing that the proposed approach is order-optimal in terms of the number of experiments up to an additive logarithmic term. Moreover, we extend our result to the setting where the size of each experiment is bounded by a constant. For this case, we show that our approach is optimal in terms of the size of the largest experiment required for the unique identification of the causal graph in the worst case.
    Confident Clustering via PCA Compression Ratio and Its Application to Single-cell RNA-seq Analysis. (arXiv:2205.09849v1 [cs.LG])
    Unsupervised clustering algorithms for vectors has been widely used in the area of machine learning. Many applications, including the biological data we studied in this paper, contain some boundary datapoints which show combination properties of two underlying clusters and could lower the performance of the traditional clustering algorithms. We develop a confident clustering method aiming to diminish the influence of these datapoints and improve the clustering results. Concretely, for a list of datapoints, we give two clustering results. The first-round clustering attempts to classify only pure vectors with high confidence. Based on it, we classify more vectors with less confidence in the second round. We validate our algorithm on single-cell RNA-seq data, which is a powerful and widely used tool in biology area. Our confident clustering shows a high accuracy on our tested datasets. In addition, unlike traditional clustering methods in single-cell analysis, the confident clustering shows high stability under different choices of parameters.
    How to Guide Adaptive Depth Sampling?. (arXiv:2205.10202v1 [cs.CV])
    Recent advances in depth sensing technologies allow fast electronic maneuvering of the laser beam, as opposed to fixed mechanical rotations. This will enable future sensors, in principle, to vary in real-time the sampling pattern. We examine here the abstract problem of whether adapting the sampling pattern for a given frame can reduce the reconstruction error or allow a sparser pattern. We propose a constructive generic method to guide adaptive depth sampling algorithms. Given a sampling budget B, a depth predictor P and a desired quality measure M, we propose an Importance Map that highlights important sampling locations. This map is defined for a given frame as the per-pixel expected value of M produced by the predictor P, given a pattern of B random samples. This map can be well estimated in a training phase. We show that a neural network can learn to produce a highly faithful Importance Map, given an RGB image. We then suggest an algorithm to produce a sampling pattern for the scene, which is denser in regions that are harder to reconstruct. The sampling strategy of our modular framework can be adjusted according to hardware limitations, type of depth predictor, and any custom reconstruction error measure that should be minimized. We validate through simulations that our approach outperforms grid and random sampling patterns as well as recent state-of-the-art adaptive algorithms.
    Generalisation effects of predictive uncertainty estimation in deep learning for digital pathology. (arXiv:2112.09693v2 [cs.LG] UPDATED)
    Deep learning (DL) has shown great potential in digital pathology applications. The robustness of a diagnostic DL-based solution is essential for safe clinical deployment. In this work we evaluate if adding uncertainty estimates for DL predictions in digital pathology could result in increased value for the clinical applications, by boosting the general predictive performance or by detecting mispredictions. We compare the effectiveness of model-integrated methods (MC dropout and Deep ensembles) with a model-agnostic approach (Test time augmentation, TTA). Moreover, four uncertainty metrics are compared. Our experiments focus on two domain shift scenarios: a shift to a different medical center and to an underrepresented subtype of cancer. Our results show that uncertainty estimates increase reliability by reducing a model's sensitivity to classification threshold selection as well as by detecting between 70\% and 90\% of the mispredictions done by the model. Overall, the deep ensembles method achieved the best performance closely followed by TTA.
    Remember and Forget Experience Replay for Multi-Agent Reinforcement Learning. (arXiv:2203.13319v2 [cs.LG] UPDATED)
    We present the extension of the Remember and Forget for Experience Replay (ReF-ER) algorithm to Multi-Agent Reinforcement Learning (MARL). {ReF-ER} was shown to outperform state of the art algorithms for continuous control in problems ranging from the OpenAI Gym to complex fluid flows. In MARL, the dependencies between the agents are included in the state-value estimator and the environment dynamics are modeled via the importance weights used by ReF-ER. In collaborative environments, we find the best performance when the value is estimated using individual rewards and we ignore the effects of other actions on the transition map. We benchmark the performance of ReF-ER MARL on the Stanford Intelligent Systems Laboratory (SISL) environments. We find that employing a single feed-forward neural network for the policy and the value function in ReF-ER MARL, outperforms state of the art algorithms that rely on complex neural network architectures.
    Improving Multi-Task Generalization via Regularizing Spurious Correlation. (arXiv:2205.09797v1 [cs.LG])
    Multi-Task Learning (MTL) is a powerful learning paradigm to improve generalization performance via knowledge sharing. However, existing studies find that MTL could sometimes hurt generalization, especially when two tasks are less correlated. One possible reason that hurts generalization is spurious correlation, i.e., some knowledge is spurious and not causally related to task labels, but the model could mistakenly utilize them and thus fail when such correlation changes. In MTL setup, there exist several unique challenges of spurious correlation. First, the risk of having non-causal knowledge is higher, as the shared MTL model needs to encode all knowledge from different tasks, and causal knowledge for one task could be potentially spurious to the other. Second, the confounder between task labels brings in a different type of spurious correlation to MTL. We theoretically prove that MTL is more prone to taking non-causal knowledge from other tasks than single-task learning, and thus generalize worse. To solve this problem, we propose Multi-Task Causal Representation Learning framework, aiming to represent multi-task knowledge via disentangled neural modules, and learn which module is causally related to each task via MTL-specific invariant regularization. Experiments show that it could enhance MTL model's performance by 5.5% on average over Multi-MNIST, MovieLens, Taskonomy, CityScape, and NYUv2, via alleviating spurious correlation problem.
    Continual learning on 3D point clouds with random compressed rehearsal. (arXiv:2205.08013v2 [cs.LG] UPDATED)
    Contemporary deep neural networks offer state-of-the-art results when applied to visual reasoning, e.g., in the context of 3D point cloud data. Point clouds are important datatype for precise modeling of three-dimensional environments, but effective processing of this type of data proves to be challenging. In the world of large, heavily-parameterized network architectures and continuously-streamed data, there is an increasing need for machine learning models that can be trained on additional data. Unfortunately, currently available models cannot fully leverage training on additional data without losing their past knowledge. Combating this phenomenon, called catastrophic forgetting, is one of the main objectives of continual learning. Continual learning for deep neural networks has been an active field of research, primarily in 2D computer vision, natural language processing, reinforcement learning, and robotics. However, in 3D computer vision, there are hardly any continual learning solutions specifically designed to take advantage of point cloud structure. This work proposes a novel neural network architecture capable of continual learning on 3D point cloud data. We utilize point cloud structure properties for preserving a heavily compressed set of past data. By using rehearsal and reconstruction as regularization methods of the learning process, our approach achieves a significant decrease of catastrophic forgetting compared to the existing solutions on several most popular point cloud datasets considering two continual learning settings: when a task is known beforehand, and in the challenging scenario of when task information is unknown to the model.
    Translating Hanja historical documents to understandable Korean and English. (arXiv:2205.10019v1 [cs.CL])
    The Annals of Joseon Dynasty (AJD) contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea. The Annals were originally written in an archaic Korean writing system, `Hanja', and translated into Korean from 1968 to 1993. However, this translation was literal and contained many archaic Korean words; thus, a new expert translation effort began in 2012, completing the records of only one king in a decade. Also, expert translators are working on an English translation, of which only one king's records are available because of the high cost and slow progress. Thus, we propose H2KE, the neural machine translation model that translates Hanja historical documents to understandable Korean and English. Based on the multilingual neural machine translation approach, it translates the historical document written in Hanja, using both the full dataset of outdated Korean translation and a small dataset of recently translated Korean and English. We compare our method with two baselines: one is a recent model that simultaneously learns to restore and translate Hanja historical document and the other is the transformer that trained on newly translated corpora only. The results show that our method significantly outperforms the baselines in terms of BLEU score in both modern Korean and English translations. We also conduct a human evaluation that shows that our translation is preferred over the original expert translation.
    A Case of Exponential Convergence Rates for SVM. (arXiv:2205.10055v1 [stat.ML])
    Classification is often the first problem described in introductory machine learning classes. Generalization guarantees of classification have historically been offered by Vapnik-Chervonenkis theory. Yet those guarantees are based on intractable algorithms, which has led to the theory of surrogate methods in classification. Guarantees offered by surrogate methods are based on calibration inequalities, which have been shown to be highly sub-optimal under some margin conditions, failing short to capture exponential convergence phenomena. Those "super" fast rates are becoming to be well understood for smooth surrogates, but the picture remains blurry for non-smooth losses such as the hinge loss, associated with the renowned support vector machines. In this paper, we present a simple mechanism to obtain fast convergence rates and we investigate its usage for SVM. In particular, we show that SVM can exhibit exponential convergence rates even without assuming the hard Tsybakov margin condition.
    Semi-self-supervised Automated ICD Coding. (arXiv:2205.10088v1 [cs.CL])
    Clinical Text Notes (CTNs) contain physicians' reasoning process, written in an unstructured free text format, as they examine and interview patients. In recent years, several studies have been published that provide evidence for the utility of machine learning for predicting doctors' diagnoses from CTNs, a task known as ICD coding. Data annotation is time consuming, particularly when a degree of specialization is needed, as is the case for medical data. This paper presents a method of augmenting a sparsely annotated dataset of Icelandic CTNs with a machine-learned imputation in a semi-self-supervised manner. We train a neural network on a small set of annotated CTNs and use it to extract clinical features from a set of un-annotated CTNs. These clinical features consist of answers to about a thousand potential questions that a physician might find the answers to during a consultation of a patient. The features are then used to train a classifier for the diagnosis of certain types of diseases. We report the results of an evaluation of this data augmentation method over three tiers of data availability to the physician. Our data augmentation method shows a significant positive effect which is diminished when clinical features from the examination of the patient and diagnostics are made available. We recommend our method for augmenting scarce datasets for systems that take decisions based on clinical features that do not include examinations or tests.
    Deep reinforcement learning under signal temporal logic constraints using Lagrangian relaxation. (arXiv:2201.08504v3 [stat.ML] UPDATED)
    Deep reinforcement learning (DRL) has attracted much attention as an approach to solve sequential decision making problems without mathematical models of systems or environments. In general, a constraint may be imposed on a decision making. In this study, we consider the optimal decision making problems with constraints to complete temporal high-level tasks in the continuous state-action domain. We describe the constraints using signal temporal logic (STL), which is useful for time sensitive control tasks since it can specify continuous signals within a bounded time interval. To deal with the STL constraints, we introduce an extended constrained Markov decision process (CMDP), which is called a $\tau$-CMDP. We formulate the STL constrained optimal decision making problem as the $\tau$-CMDP and propose a two-phase constrained DRL algorithm using the Lagrangian relaxation method. Through simulations, we also demonstrate the learning performance of the proposed algorithm.
    Towards biologically plausible Dreaming and Planning. (arXiv:2205.10044v1 [cs.LG])
    Humans and animals can learn new skills after practicing for a few hours, while current reinforcement learning algorithms require a large amount of data to achieve good performances. Recent model-based approaches show promising results by reducing the number of necessary interactions with the environment to learn a desirable policy. However, these methods require biological implausible ingredients, such as the detailed storage of older experiences, and long periods of offline learning. The optimal way to learn and exploit word-models is still an open question. Taking inspiration from biology, we suggest that dreaming might be an efficient expedient to use an inner model. We propose a two-module (agent and model) neural network in which "dreaming" (living new experiences in a model-based simulated environment) significantly boosts learning. We also explore "planning", an online alternative to dreaming, that shows comparable performances. Importantly, our model does not require the detailed storage of experiences, and learns online the world-model. This is a key ingredient for biological plausibility and implementability (e.g., in neuromorphic hardware). Our network is composed of spiking neurons, further increasing the energetic efficiency and the plausibility of the model. To our knowledge, there are no previous works proposing biologically plausible model-based reinforcement learning in recurrent spiking networks. Our work is a step toward building efficient neuromorphic systems for autonomous robots, capable of learning new skills in real-world environments. Even when the environment is no longer accessible, the robot optimizes learning by "reasoning" in its own "mind". These approaches are of great relevance when the acquisition from the environment is slow, expensive (robotics) or unsafe (autonomous driving).
    Adversarial Sample Detection for Speaker Verification by Neural Vocoders. (arXiv:2107.00309v4 [cs.SD] UPDATED)
    Automatic speaker verification (ASV), one of the most important technology for biometric identification, has been widely adopted in security-critical applications. However, ASV is seriously vulnerable to recently emerged adversarial attacks, yet effective countermeasures against them are limited. In this paper, we adopt neural vocoders to spot adversarial samples for ASV. We use the neural vocoder to re-synthesize audio and find that the difference between the ASV scores for the original and re-synthesized audio is a good indicator for discrimination between genuine and adversarial samples. This effort is, to the best of our knowledge, among the first to pursue such a technical direction for detecting time-domain adversarial samples for ASV, and hence there is a lack of established baselines for comparison. Consequently, we implement the Griffin-Lim algorithm as the detection baseline. The proposed approach achieves effective detection performance that outperforms the baselines in all the settings. We also show that the neural vocoder adopted in the detection framework is dataset-independent. Our codes will be made open-source for future works to do fair comparison.
    User Localization using RF Sensing: A Performance comparison between LIS and mmWave Radars. (arXiv:2205.10321v1 [eess.SP])
    Since electromagnetic signals are omnipresent, Radio Frequency (RF)-sensing has the potential to become a universal sensing mechanism with applications in localization, smart-home, retail, gesture recognition, intrusion detection, etc. Two emerging technologies in RF-sensing, namely sensing through Large Intelligent Surfaces (LISs) and mmWave Frequency-Modulated Continuous-Wave (FMCW) radars, have been successfully applied to a wide range of applications. In this work, we compare LIS and mmWave radars for localization in real-world and simulated environments. In our experiments, the mmWave radar achieves 0.71 Intersection Over Union (IOU) and 3cm error for bounding boxes, while LIS has 0.56 IOU and 10cm distance error. Although the radar outperforms the LIS in terms of accuracy, LIS features additional applications in communication in addition to sensing scenarios.
    Visual Concepts Tokenization. (arXiv:2205.10093v1 [cs.CV])
    Obtaining the human-like perception ability of abstracting visual concepts from concrete pixels has always been a fundamental and important target in machine learning research fields such as disentangled representation learning and scene decomposition. Towards this goal, we propose an unsupervised transformer-based Visual Concepts Tokenization framework, dubbed VCT, to perceive an image into a set of disentangled visual concept tokens, with each concept token responding to one type of independent visual concept. Particularly, to obtain these concept tokens, we only use cross-attention to extract visual information from the image tokens layer by layer without self-attention between concept tokens, preventing information leakage across concept tokens. We further propose a Concept Disentangling Loss to facilitate that different concept tokens represent independent visual concepts. The cross-attention and disentangling loss play the role of induction and mutual exclusion for the concept tokens, respectively. Extensive experiments on several popular datasets verify the effectiveness of VCT on the tasks of disentangled representation learning and scene decomposition. VCT achieves the state of the art results by a large margin.
    The Fellowship of the Dyson Ring: ACT\&Friends' Results and Methods for GTOC 11. (arXiv:2205.10124v1 [cs.AI])
    Dyson spheres are hypothetical megastructures encircling stars in order to harvest most of their energy output. During the 11th edition of the GTOC challenge, participants were tasked with a complex trajectory planning related to the construction of a precursor Dyson structure, a heliocentric ring made of twelve stations. To this purpose, we developed several new approaches that synthesize techniques from machine learning, combinatorial optimization, planning and scheduling, and evolutionary optimization effectively integrated into a fully automated pipeline. These include a machine learned transfer time estimator, improving the established Edelbaum approximation and thus better informing a Lazy Race Tree Search to identify and collect asteroids with high arrival mass for the stations; a series of optimally-phased low-thrust transfers to all stations computed by indirect optimization techniques, exploiting the synodic periodicity of the system; and a modified Hungarian scheduling algorithm, which utilizes evolutionary techniques to arrange a mass-balanced arrival schedule out of all transfer possibilities. We describe the steps of our pipeline in detail with a special focus on how our approaches mutually benefit from each other. Lastly, we outline and analyze the final solution of our team, ACT&Friends, which ranked second at the GTOC 11 challenge.
    Lifelong Neural Predictive Coding: Learning Cumulatively Online without Forgetting. (arXiv:1905.10696v3 [cs.LG] UPDATED)
    In lifelong learning systems based on artificial neural networks, one of the biggest obstacles is the inability to retain old knowledge as new information is encountered. This phenomenon is known as catastrophic forgetting. In this paper, we propose a new kind of connectionist architecture, the Sequential Neural Coding Network, that is robust to forgetting when learning from streams of data points and, unlike networks of today, does not learn via the popular back-propagation of errors. Grounded in the neurocognitive theory of predictive processing, our model adapts synapses in a biologically-plausible fashion while another neural system learns to direct and control this cortex-like structure by mimicking some of task-executive control functionality of the basal ganglia. In our experiments, we demonstrate that our self-organizing system experiences significantly less forgetting compared to standard neural models, outperforming a swath of previously proposed methods, including rehearsal/data buffer-based methods, on both standard (SplitMNIST, Split Fashion MNIST, etc.) and custom benchmarks even though it is trained in a stream-like fashion. Our work offers evidence that emulating mechanisms in real neuronal systems, e.g., local learning, lateral competition, can yield new directions for tackling the grand challenge of lifelong machine learning.
    Nonlinear Initialization Methods for Low-Rank Neural Networks. (arXiv:2202.00834v3 [cs.LG] UPDATED)
    We propose a novel low-rank initialization framework for training low-rank deep neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices. The most successful prior existing approach, spectral initialization, draws a sample from the initialization distribution for the full-rank setting and then optimally approximates the full-rank initialization parameters in the Frobenius norm with a pair of low-rank initialization matrices via singular value decomposition. Our method is inspired by the insight that approximating the function corresponding to each layer is more important than approximating the parameter values. We provably demonstrate that there is a significant gap between these two approaches for ReLU networks, particularly as the desired rank of the approximating weights decreases, or as the dimension of the inputs to the layer increases (the latter point holds when the network width is super-linear in dimension). Along the way, we provide the first provably efficient algorithm for solving the ReLU low-rank approximation problem for fixed parameter rank $r$ -- previously, it was unknown that the problem was computationally tractable to solve even for rank $1$. We also provide a practical algorithm to solve this problem which is no more expensive than the existing spectral initialization approach, and validate our theory by training ResNet and EfficientNet models (He et al., 2016; Tan & Le, 2019) on ImageNet (Russakovsky et al., 2015).
    Exploring the Trade-off between Plausibility, Change Intensity and Adversarial Power in Counterfactual Explanations using Multi-objective Optimization. (arXiv:2205.10232v1 [cs.LG])
    There is a broad consensus on the importance of deep learning models in tasks involving complex data. Often, an adequate understanding of these models is required when focusing on the transparency of decisions in human-critical applications. Besides other explainability techniques, trustworthiness can be achieved by using counterfactuals, like the way a human becomes familiar with an unknown process: by understanding the hypothetical circumstances under which the output changes. In this work we argue that automated counterfactual generation should regard several aspects of the produced adversarial instances, not only their adversarial capability. To this end, we present a novel framework for the generation of counterfactual examples which formulates its goal as a multi-objective optimization problem balancing three different objectives: 1) plausibility, i.e., the likeliness of the counterfactual of being possible as per the distribution of the input data; 2) intensity of the changes to the original input; and 3) adversarial power, namely, the variability of the model's output induced by the counterfactual. The framework departs from a target model to be audited and uses a Generative Adversarial Network to model the distribution of input data, together with a multi-objective solver for the discovery of counterfactuals balancing among these objectives. The utility of the framework is showcased over six classification tasks comprising image and three-dimensional data. The experiments verify that the framework unveils counterfactuals that comply with intuition, increasing the trustworthiness of the user, and leading to further insights, such as the detection of bias and data misrepresentation.
    Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Priors. (arXiv:2205.10279v1 [cs.LG])
    Deep learning is increasingly moving towards a transfer learning paradigm whereby large foundation models are fine-tuned on downstream tasks, starting from an initialization learned on the source task. But an initialization contains relatively little information about the source task. Instead, we show that we can learn highly informative posteriors from the source task, through supervised or self-supervised approaches, which then serve as the basis for priors that modify the whole loss surface on the downstream task. This simple modular approach enables significant performance gains and more data-efficient learning on a variety of downstream classification and segmentation tasks, serving as a drop-in replacement for standard pre-training strategies. These highly informative priors also can be saved for future use, similar to pre-trained weights, and stand in contrast to the zero-mean isotropic uninformative priors that are typically used in Bayesian deep learning.
    Heterformer: A Transformer Architecture for Node Representation Learning on Heterogeneous Text-Rich Networks. (arXiv:2205.10282v1 [cs.CL])
    We study node representation learning on heterogeneous text-rich networks, where nodes and edges are multi-typed and some types of nodes are associated with text information. Although recent studies on graph neural networks (GNNs) and pretrained language models (PLMs) have demonstrated their power in encoding network and text signals, respectively, less focus has been given to delicately coupling these two types of models on heterogeneous text-rich networks. Specifically, existing GNNs rarely model text in each node in a contextualized way; existing PLMs can hardly be applied to characterize graph structures due to their sequence architecture. In this paper, we propose Heterformer, a Heterogeneous GNN-nested transformer that blends GNNs and PLMs into a unified model. Different from previous "cascaded architectures" that directly add GNN layers upon a PLM, our Heterformer alternately stacks two modules - a graph-attention-based neighbor aggregation module and a transformer-based text and neighbor joint encoding module - to facilitate thorough mutual enhancement between network and text signals. Meanwhile, Heterformer is capable of characterizing network heterogeneity and nodes without text information. Comprehensive experiments on three large-scale datasets from different domains demonstrate the superiority of Heterformer over state-of-the-art baselines in link prediction, transductive/inductive node classification, node clustering, and semantics-based retrieval.
    Set-based Meta-Interpolation for Few-Task Meta-Learning. (arXiv:2205.09990v1 [cs.LG])
    Meta-learning approaches enable machine learning systems to adapt to new tasks given few examples by leveraging knowledge from related tasks. However, a large number of meta-training tasks are still required for generalization to unseen tasks during meta-testing, which introduces a critical bottleneck for real-world problems that come with only few tasks, due to various reasons including the difficulty and cost of constructing tasks. Recently, several task augmentation methods have been proposed to tackle this issue using domain-specific knowledge to design augmentation techniques to densify the meta-training task distribution. However, such reliance on domain-specific knowledge renders these methods inapplicable to other domains. While Manifold Mixup based task augmentation methods are domain-agnostic, we empirically find them ineffective on non-image domains. To tackle these limitations, we propose a novel domain-agnostic task augmentation method, Meta-Interpolation, which utilizes expressive neural set functions to densify the meta-training task distribution using bilevel optimization. We empirically validate the efficacy of Meta-Interpolation on eight datasets spanning across various domains such as image classification, molecule property prediction, text classification and speech recognition. Experimentally, we show that Meta-Interpolation consistently outperforms all the relevant baselines. Theoretically, we prove that task interpolation with the set function regularizes the meta-learner to improve generalization.
    A Proximal Algorithm for Sampling from Non-convex Potentials. (arXiv:2205.10188v1 [cs.LG])
    We study sampling problems associated with non-convex potentials that meanwhile lack smoothness. In particular, we consider target distributions that satisfy either logarithmic-Sobolev inequality or Poincar\'e inequality. Rather than smooth, the potentials are assumed to be semi-smooth or the summation of multiple semi-smooth functions. We develop a sampling algorithm that resembles proximal algorithms in optimization for this challenging sampling task. Our algorithm is based on a special case of Gibbs sampling known as the alternating sampling framework (ASF). The key contribution of this work is a practical realization of the ASF based on rejection sampling in the non-convex and semi-smooth setting. This work extends the recent algorithm in \cite{LiaChe21,LiaChe22} for non-smooth/semi-smooth log-concave distribution to the setting with non-convex potentials. In almost all the cases of sampling considered in this work, our proximal sampling algorithm achieves better complexity than all existing methods.
    Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. (arXiv:2205.09853v1 [cs.CV])
    Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction -- when only future/past frames are masked; unconditional generation -- when both past and future frames are masked; and interpolation -- when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using $\le$ 4 GPUs. https://mask-cond-video-diffusion.github.io
    A New Feature Selection Method for LogNNet and its Application for Diagnosis and Prognosis of COVID-19 Disease Using Routine Blood Values. (arXiv:2205.09974v1 [cs.LG])
    Since February-2020, the world has embarked on an intense struggle with the COVID-19 disease, and health systems have come under a tragic pressure as the disease turned into a pandemic. The aim of this study is to determine the most effective routine-blood-values (RBV) in the diagnosis/prognosis of COVID-19 using new feature selection method for LogNNet reservoir neural network. First dataset in this study consists of a total of 5296-patients with a same number of negative and positive covid test. Second dataset consists of a total of 3899-patients with a diagnosis of COVID-19, who were treated in hospital with severe-infected (203) and mildly-infected (3696). The most important RBVs that affect the diagnosis of the disease from the first dataset were mean-corpuscular-hemoglobin-concentration (MCHC), mean-corpuscular-hemoglobin (MCH) and activated-partial-prothrombin-time (aPTT). The most effective features in the prognosis of the disease were erythrocyte-sedimentation-rate (ESR), neutrophil-count (NEU), C-reactive-protein (CRP). LogNNet-model achieved an accuracy rate of A46 = 99.5% in the diagnosis of the disease with 46 features and A3 = 99.17% with only MCHC, MCH, and aPTT features. Model reached an accuracy rate of A48 = 94.4% in determining the prognosis of the disease with 48 features and A3 = 82.7% with only ESR, NEU, and CRP features. LogNNet model demonstrated a very high disease diagnosis/prognosis of COVID-19 performance without knowing about the symptoms or history of the patients. The model is suitable for devices with low resources (3-14 kB of RAM used on the Arduino microcontroller), and is promising to create mobile health monitoring systems in the Internet of Things. Our method will reduce the negative pressures on the health sector and help doctors understand pathogenesis of COVID-19 through key futures and contribute positively to the treatment processes.
    The price of ignorance: how much does it cost to forget noise structure in low-rank matrix estimation?. (arXiv:2205.10009v1 [cs.IT])
    We consider the problem of estimating a rank-1 signal corrupted by structured rotationally invariant noise, and address the following question: how well do inference algorithms perform when the noise statistics is unknown and hence Gaussian noise is assumed? While the matched Bayes-optimal setting with unstructured noise is well understood, the analysis of this mismatched problem is only at its premises. In this paper, we make a step towards understanding the effect of the strong source of mismatch which is the noise statistics. Our main technical contribution is the rigorous analysis of a Bayes estimator and of an approximate message passing (AMP) algorithm, both of which incorrectly assume a Gaussian setup. The first result exploits the theory of spherical integrals and of low-rank matrix perturbations; the idea behind the second one is to design and analyze an artificial AMP which, by taking advantage of the flexibility in the denoisers, is able to "correct" the mismatch. Armed with these sharp asymptotic characterizations, we unveil a rich and often unexpected phenomenology. For example, despite AMP is in principle designed to efficiently compute the Bayes estimator, the former is outperformed by the latter in terms of mean-square error. We show that this performance gap is due to an incorrect estimation of the signal norm. In fact, when the SNR is large enough, the overlaps of the AMP and the Bayes estimator coincide, and they even match those of optimal estimators taking into account the structure of the noise.
    Lifelong Personal Context Recognition. (arXiv:2205.10123v1 [cs.AI])
    We focus on the development of AIs which live in lifelong symbiosis with a human. The key prerequisite for this task is that the AI understands - at any moment in time - the personal situational context that the human is in. We outline the key challenges that this task brings forth, namely (i) handling the human-like and ego-centric nature of the the user's context, necessary for understanding and providing useful suggestions, (ii) performing lifelong context recognition using machine learning in a way that is robust to change, and (iii) maintaining alignment between the AI's and human's representations of the world through continual bidirectional interaction. In this short paper, we summarize our recent attempts at tackling these challenges, discuss the lessons learned, and highlight directions of future research. The main take-away message is that pursuing this project requires research which lies at the intersection of knowledge representation and machine learning. Neither technology can achieve this goal without the other.
    Machine Learning for Combinatorial Optimisation of Partially-Specified Problems: Regret Minimisation as a Unifying Lens. (arXiv:2205.10157v1 [cs.LG])
    It is increasingly common to solve combinatorial optimisation problems that are partially-specified. We survey the case where the objective function or the relations between variables are not known or are only partially specified. The challenge is to learn them from available data, while taking into account a set of hard constraints that a solution must satisfy, and that solving the optimisation problem (esp. during learning) is computationally very demanding. This paper overviews four seemingly unrelated approaches, that can each be viewed as learning the objective function of a hard combinatorial optimisation problem: 1) surrogate-based optimisation, 2) empirical model learning, 3) decision-focused learning (`predict + optimise'), and 4) structured-output prediction. We formalise each learning paradigm, at first in the ways commonly found in the literature, and then bring the formalisations together in a compatible way using regret. We discuss the differences and interactions between these frameworks, highlight the opportunities for cross-fertilization and survey open directions.
    Triangulation candidates for Bayesian optimization. (arXiv:2112.07457v2 [stat.CO] UPDATED)
    Bayesian optimization involves "inner optimization" over a new-data acquisition criterion which is non-convex/highly multi-modal, may be non-differentiable, or may otherwise thwart local numerical optimizers. In such cases it is common to replace continuous search with a discrete one over random candidates. Here we propose using candidates based on a Delaunay triangulation of the existing input design. We detail the construction of these "tricands" and demonstrate empirically how they outperform both numerically optimized acquisitions and random candidate-based alternatives, and are well-suited for hybrid schemes, on benchmark synthetic and real simulation experiments.
    Greedy structure learning from data that contain systematic missing values. (arXiv:2107.04184v3 [cs.LG] UPDATED)
    Learning from data that contain missing values represents a common phenomenon in many domains. Relatively few Bayesian Network structure learning algorithms account for missing data, and those that do tend to rely on standard approaches that assume missing data are missing at random, such as the Expectation-Maximisation algorithm. Because missing data are often systematic, there is a need for more pragmatic methods that can effectively deal with data sets containing missing values not missing at random. The absence of approaches that deal with systematic missing data impedes the application of BN structure learning methods to real-world problems where missingness are not random. This paper describes three variants of greedy search structure learning that utilise pairwise deletion and inverse probability weighting to maximally leverage the observed data and to limit potential bias caused by missing values. The first two of the variants can be viewed as sub-versions of the third and best performing variant, but are important in their own in illustrating the successive improvements in learning accuracy. The empirical investigations show that the proposed approach outperforms the commonly used and state-of-the-art Structural EM algorithm, both in terms of learning accuracy and efficiency, as well as both when data are missing at random and not at random.
    Robot Learning of Mobile Manipulation with Reachability Behavior Priors. (arXiv:2203.04051v2 [cs.RO] UPDATED)
    Mobile Manipulation (MM) systems are ideal candidates for taking up the role of a personal assistant in unstructured real-world environments. Among other challenges, MM requires effective coordination of the robot's embodiments for executing tasks that require both mobility and manipulation. Reinforcement Learning (RL) holds the promise of endowing robots with adaptive behaviors, but most methods require prohibitively large amounts of data for learning a useful control policy. In this work, we study the integration of robotic reachability priors in actor-critic RL methods for accelerating the learning of MM for reaching and fetching tasks. Namely, we consider the problem of optimal base placement and the subsequent decision of whether to activate the arm for reaching a 6D target. For this, we devise a novel Hybrid RL method that handles discrete and continuous actions jointly, resorting to the Gumbel-Softmax reparameterization. Next, we train a reachability prior using data from the operational robot workspace, inspired by classical methods. Subsequently, we derive Boosted Hybrid RL (BHyRL), a novel algorithm for learning Q-functions by modeling them as a sum of residual approximators. Every time a new task needs to be learned, we can transfer our learned residuals and learn the component of the Q-function that is task-specific, hence, maintaining the task structure from prior behaviors. Moreover, we find that regularizing the target policy with a prior policy yields more expressive behaviors. We evaluate our method in simulation in reaching and fetching tasks of increasing difficulty, and we show the superior performance of BHyRL against baseline methods. Finally, we zero-transfer our learned 6D fetching policy with BHyRL to our MM robot TIAGo++. For more details and code release, please refer to our project site: irosalab.com/rlmmbp
    Test-time Batch Normalization. (arXiv:2205.10210v1 [cs.LG])
    Deep neural networks often suffer the data distribution shift between training and testing, and the batch statistics are observed to reflect the shift. In this paper, targeting of alleviating distribution shift in test time, we revisit the batch normalization (BN) in the training process and reveals two key insights benefiting test-time optimization: $(i)$ preserving the same gradient backpropagation form as training, and $(ii)$ using dataset-level statistics for robust optimization and inference. Based on the two insights, we propose a novel test-time BN layer design, GpreBN, which is optimized during testing by minimizing Entropy loss. We verify the effectiveness of our method on two typical settings with distribution shift, i.e., domain generalization and robustness tasks. Our GpreBN significantly improves the test-time performance and achieves the state of the art results.
    DEMAND: Deep Matrix Approximately NonlinearDecomposition to Identify Meta, Canonical, and Sub-Spatial Pattern of functional Magnetic Resonance Imaging in the Human Brain. (arXiv:2205.10264v1 [cs.LG])
    Deep Neural Networks (DNNs) have already become a crucial computational approach to revealing the spatial patterns in the human brain; however, there are three major shortcomings in utilizing DNNs to detect the spatial patterns in functional Magnetic Resonance Signals: 1). It is a fully connected architecture that increases the complexity of network structures that is difficult to optimize and vulnerable to overfitting; 2). The requirement of large training samples results in erasing the individual/minor patterns in feature extraction; 3). The hyperparameters are required to be tuned manually, which is time-consuming. Therefore, we propose a novel deep nonlinear matrix factorization named Deep Matrix Approximately Nonlinear Decomposition (DEMAND) in this work to take advantage of the shallow linear model, e.g., Sparse Dictionary Learning (SDL) and DNNs. At first, the proposed DEMAND employs a non-fully connected and multilayer-stacked architecture that is easier to be optimized compared with canonical DNNs; furthermore, due to the efficient architecture, training DEMAND can avoid overfitting and enables the recognition of individual/minor features based on a small dataset such as an individual data; finally, a novel rank estimator technique is introduced to tune all hyperparameters of DEMAND automatically. Moreover, the proposed DEMAND is validated by four other peer methodologies via real functional Magnetic Resonance Imaging data in the human brain. In short, the validation results demonstrate that DEMAND can reveal the reproducible meta, canonical, and sub-spatial features of the human brain more efficiently than other peer methodologies.
    Planning with Diffusion for Flexible Behavior Synthesis. (arXiv:2205.09991v1 [cs.LG])
    Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility.
    Leveraging Relational Information for Learning Weakly Disentangled Representations. (arXiv:2205.10056v1 [cs.LG])
    Disentanglement is a difficult property to enforce in neural representations. This might be due, in part, to a formalization of the disentanglement problem that focuses too heavily on separating relevant factors of variation of the data in single isolated dimensions of the neural representation. We argue that such a definition might be too restrictive and not necessarily beneficial in terms of downstream tasks. In this work, we present an alternative view over learning (weakly) disentangled representations, which leverages concepts from relational learning. We identify the regions of the latent space that correspond to specific instances of generative factors, and we learn the relationships among these regions in order to perform controlled changes to the latent codes. We also introduce a compound generative model that implements such a weak disentanglement approach. Our experiments shows that the learned representations can separate the relevant factors of variation in the data, while preserving the information needed for effectively generating high quality data samples.
    Explainable Supervised Domain Adaptation. (arXiv:2205.09943v1 [cs.LG])
    Domain adaptation techniques have contributed to the success of deep learning. Leveraging knowledge from an auxiliary source domain for learning in labeled data-scarce target domain is fundamental to domain adaptation. While these techniques result in increasing accuracy, the adaptation process, particularly the knowledge leveraged from the source domain, remains unclear. This paper proposes an explainable by design supervised domain adaptation framework - XSDA-Net. We integrate a case-based reasoning mechanism into the XSDA-Net to explain the prediction of a test instance in terms of similar-looking regions in the source and target train images. We empirically demonstrate the utility of the proposed framework by curating the domain adaptation settings on datasets popularly known to exhibit part-based explainability.
    Towards Understanding Grokking: An Effective Theory of Representation Learning. (arXiv:2205.10343v1 [cs.LG])
    We aim to understand grokking, a phenomenon where models generalize long after overfitting their training set. We present both a microscopic analysis anchored by an effective theory and a macroscopic analysis of phase diagrams describing learning performance across hyperparameters. We find that generalization originates from structured representations whose training dynamics and dependence on training set size can be predicted by our effective theory in a toy setting. We observe empirically the presence of four learning phases: comprehension, grokking, memorization, and confusion. We find representation learning to occur only in a "Goldilocks zone" (including comprehension and grokking) between memorization and confusion. Compared to the comprehension phase, the grokking phase stays closer to the memorization phase, leading to delayed generalization. The Goldilocks phase is reminiscent of "intelligence from starvation" in Darwinian evolution, where resource limitations drive discovery of more efficient solutions. This study not only provides intuitive explanations of the origin of grokking, but also highlights the usefulness of physics-inspired tools, e.g., effective theories and phase diagrams, for understanding deep learning.
    Constructive Interpretability with CoLabel: Corroborative Integration, Complementary Features, and Collaborative Learning. (arXiv:2205.10011v1 [cs.CV])
    Machine learning models with explainable predictions are increasingly sought after, especially for real-world, mission-critical applications that require bias detection and risk mitigation. Inherent interpretability, where a model is designed from the ground-up for interpretability, provides intuitive insights and transparent explanations on model prediction and performance. In this paper, we present CoLabel, an approach to build interpretable models with explanations rooted in the ground truth. We demonstrate CoLabel in a vehicle feature extraction application in the context of vehicle make-model recognition (VMMR). CoLabel performs VMMR with a composite of interpretable features such as vehicle color, type, and make, all based on interpretable annotations of the ground truth labels. First, CoLabel performs corroborative integration to join multiple datasets that each have a subset of desired annotations of color, type, and make. Then, CoLabel uses decomposable branches to extract complementary features corresponding to desired annotations. Finally, CoLabel fuses them together for final predictions. During feature fusion, CoLabel harmonizes complementary branches so that VMMR features are compatible with each other and can be projected to the same semantic space for classification. With inherent interpretability, CoLabel achieves superior performance to the state-of-the-art black-box models, with accuracy of 0.98, 0.95, and 0.94 on CompCars, Cars196, and BoxCars116K, respectively. CoLabel provides intuitive explanations due to constructive interpretability, and subsequently achieves high accuracy and usability in mission-critical situations.
    Characteristic Neural Ordinary Differential Equations. (arXiv:2111.13207v3 [cs.LG] UPDATED)
    We propose Characteristic-Neural Ordinary Differential Equations (C-NODEs), a framework for extending Neural Ordinary Differential Equations (NODEs) beyond ODEs. While NODEs model the evolution of a latent variables as the solution to an ODE, C-NODE models the evolution of the latent variables as the solution of a family of first-order quasi-linear partial differential equations (PDEs) along curves on which the PDEs reduce to ODEs, referred to as characteristic curves. This in turn allows the application of the standard frameworks for solving ODEs, namely the adjoint method. Learning optimal characteristic curves for given tasks improves the performance and computational efficiency, compared to state of the art NODE models. We prove that the C-NODE framework extends the classical NODE on classification tasks by demonstrating explicit C-NODE representable functions not expressible by NODEs. Additionally, we present C-NODE-based continuous normalizing flows, which describe the density evolution of latent variables along multiple dimensions. Empirical results demonstrate the improvements provided by the proposed method for classification and density estimation on CIFAR-10, SVHN, and MNIST datasets under a similar computational budget as the existing NODE methods. The results also provide empirical evidence that the learned curves improve the efficiency of the system through a lower number of parameters and function evaluations compared with baselines.
    Towards Extremely Fast Bilevel Optimization with Self-governed Convergence Guarantees. (arXiv:2205.10054v1 [math.OC])
    Gradient methods have become mainstream techniques for Bi-Level Optimization (BLO) in learning and vision fields. The validity of existing works heavily relies on solving a series of approximation subproblems with extraordinarily high accuracy. Unfortunately, to achieve the approximation accuracy requires executing a large quantity of time-consuming iterations and computational burden is naturally caused. This paper is thus devoted to address this critical computational issue. In particular, we propose a single-level formulation to uniformly understand existing explicit and implicit Gradient-based BLOs (GBLOs). This together with our designed counter-example can clearly illustrate the fundamental numerical and theoretical issues of GBLOs and their naive accelerations. By introducing the dual multipliers as a new variable, we then establish Bilevel Alternating Gradient with Dual Correction (BAGDC), a general framework, which significantly accelerates different categories of existing methods by taking specific settings. A striking feature of our convergence result is that, compared to those original unaccelerated GBLO versions, the fast BAGDC admits a unified non-asymptotic convergence theory towards stationarity. A variety of numerical experiments have also been conducted to demonstrate the superiority of the proposed algorithmic framework.
    Minimal Explanations for Neural Network Predictions. (arXiv:2205.09901v1 [cs.LG])
    Explaining neural network predictions is known to be a challenging problem. In this paper, we propose a novel approach which can be effectively exploited, either in isolation or in combination with other methods, to enhance the interpretability of neural model predictions. For a given input to a trained neural model, our aim is to compute a smallest set of input features so that the model prediction changes when these features are disregarded by setting them to an uninformative baseline value. While computing such minimal explanations is computationally intractable in general for fully-connected neural networks, we show that the problem becomes solvable in polynomial time by a greedy algorithm under mild assumptions on the network's activation functions. We then show that our tractability result extends seamlessly to more advanced neural architectures such as convolutional and graph neural networks. We conduct experiments to showcase the capability of our method for identifying the input features that are essential to the model's prediction.
    Evolutionary Multi-Armed Bandits with Genetic Thompson Sampling. (arXiv:2205.10113v1 [cs.NE])
    As two popular schools of machine learning, online learning and evolutionary computations have become two important driving forces behind real-world decision making engines for applications in biomedicine, economics, and engineering fields. Although there are prior work that utilizes bandits to improve evolutionary algorithms' optimization process, it remains a field of blank on how evolutionary approach can help improve the sequential decision making tasks of online learning agents such as the multi-armed bandits. In this work, we propose the Genetic Thompson Sampling, a bandit algorithm that keeps a population of agents and update them with genetic principles such as elite selection, crossover and mutations. Empirical results in multi-armed bandit simulation environments and a practical epidemic control problem suggest that by incorporating the genetic algorithm into the bandit algorithm, our method significantly outperforms the baselines in nonstationary settings. Lastly, we introduce EvoBandit, a web-based interactive visualization to guide the readers through the entire learning process and perform lightweight evaluations on the fly. We hope to engage researchers into this growing field of research with this investigation.
    Instagram Fake and Automated Account Detection. (arXiv:1910.03090v3 [cs.IR] UPDATED)
    Fake engagement is one of the significant problems in Online Social Networks (OSNs) which is used to increase the popularity of an account in an inorganic manner. The detection of fake engagement is crucial because it leads to loss of money for businesses, wrong audience targeting in advertising, wrong product predictions systems, and unhealthy social network environment. This study is related with the detection of fake and automated accounts which leads to fake engagement on Instagram. Prior to this work, there were no publicly available dataset for fake and automated accounts. For this purpose, two datasets have been published for the detection of fake and automated accounts. For the detection of these accounts, machine learning algorithms like Naive Bayes, Logistic Regression, Support Vector Machines and Neural Networks are applied. Additionally, for the detection of automated accounts, cost sensitive genetic algorithm is proposed to handle the unnatural bias in the dataset. To deal with the unevenness problem in the fake dataset, Smote-nc algorithm is implemented. For the automated and fake account detection datasets, 86% and 96% classification accuracies are obtained, respectively.
    A Fully Controllable Agent in the Path Planning using Goal-Conditioned Reinforcement Learning. (arXiv:2205.09967v1 [cs.AI])
    The aim of path planning is to reach the goal from starting point by searching for the route of an agent. In the path planning, the routes may vary depending on the number of variables such that it is important for the agent to reach various goals. Numerous studies, however, have dealt with a single goal that is predefined by the user. In the present study, I propose a novel reinforcement learning framework for a fully controllable agent in the path planning. To do this, I propose a bi-directional memory editing to obtain various bi-directional trajectories of the agent, in which the behavior of the agent and sub-goals are trained on the goal-conditioned RL. As for moving the agent in various directions, I utilize the sub-goals dedicated network, separated from a policy network. Lastly, I present the reward shaping to shorten the number of steps for the agent to reach the goal. In the experimental result, the agent was able to reach the various goals that have never been visited by the agent in the training. We confirmed that the agent could perform difficult missions such as a round trip and the agent used the shorter route with the reward shaping.
    Generating Semantic Adversarial Examples via Feature Manipulation. (arXiv:2001.02297v2 [cs.LG] UPDATED)
    The vulnerability of deep neural networks to adversarial attacks has been widely demonstrated (e.g., adversarial example attacks). Traditional attacks perform unstructured pixel-wise perturbation to fool the classifier. An alternative approach is to have perturbations in the latent space. However, such perturbations are hard to control due to the lack of interpretability and disentanglement. In this paper, we propose a more practical adversarial attack by designing structured perturbation with semantic meanings. Our proposed technique manipulates the semantic attributes of images via the disentangled latent codes. The intuition behind our technique is that images in similar domains have some commonly shared but theme-independent semantic attributes, e.g. thickness of lines in handwritten digits, that can be bidirectionally mapped to disentangled latent codes. We generate adversarial perturbation by manipulating a single or a combination of these latent codes and propose two unsupervised semantic manipulation approaches: vector-based disentangled representation and feature map-based disentangled representation, in terms of the complexity of the latent codes and smoothness of the reconstructed images. We conduct extensive experimental evaluations on real-world image data to demonstrate the power of our attacks for black-box classifiers. We further demonstrate the existence of a universal, image-agnostic semantic adversarial example.
    Classification of Intra-Pulse Modulation of Radar Signals by Feature Fusion Based Convolutional Neural Networks. (arXiv:2205.09834v1 [cs.LG])
    Detection and classification of radars based on pulses they transmit is an important application in electronic warfare systems. In this work, we propose a novel deep-learning based technique that automatically recognizes intra-pulse modulation types of radar signals. Re-assigned spectrogram of measured radar signal and detected outliers of its instantaneous phases filtered by a special function are used for training multiple convolutional neural networks. Automatically extracted features from the networks are fused to distinguish frequency and phase modulated signals. Simulation results show that the proposed FF-CNN (Feature Fusion based Convolutional Neural Network) technique outperforms the current state-of-the-art alternatives and is easily scalable among broad range of modulation types.
    BayesPCN: A Continually Learnable Predictive Coding Associative Memory. (arXiv:2205.09930v1 [cs.LG])
    Associative memory plays an important role in human intelligence and its mechanisms have been linked to attention in machine learning. While the machine learning community's interest in associative memories has recently been rekindled, most work has focused on memory recall ($read$) over memory learning ($write$). In this paper, we present BayesPCN, a hierarchical associative memory capable of performing continual one-shot memory writes without meta-learning. Moreover, BayesPCN is able to gradually forget past observations ($forget$) to free its memory. Experiments show that BayesPCN can recall corrupted i.i.d. high-dimensional data observed hundreds of "timesteps" ago without a significant drop in recall ability compared to the state-of-the-art offline-learned associative memory models.
    EXODUS: Stable and Efficient Training of Spiking Neural Networks. (arXiv:2205.10242v1 [cs.NE])
    Spiking Neural Networks (SNNs) are gaining significant traction in machine learning tasks where energy-efficiency is of utmost importance. Training such networks using the state-of-the-art back-propagation through time (BPTT) is, however, very time-consuming. Previous work by Shrestha and Orchard [2018] employs an efficient GPU-accelerated back-propagation algorithm called SLAYER, which speeds up training considerably. SLAYER, however, does not take into account the neuron reset mechanism while computing the gradients, which we argue to be the source of numerical instability. To counteract this, SLAYER introduces a gradient scale hyperparameter across layers, which needs manual tuning. In this paper, (i) we modify SLAYER and design an algorithm called EXODUS, that accounts for the neuron reset mechanism and applies the Implicit Function Theorem (IFT) to calculate the correct gradients (equivalent to those computed by BPTT), (ii) we eliminate the need for ad-hoc scaling of gradients, thus, reducing the training complexity tremendously, (iii) we demonstrate, via computer simulations, that EXODUS is numerically stable and achieves a comparable or better performance than SLAYER especially in various tasks with SNNs that rely on temporal features. Our code is available at https://github.com/synsense/sinabs-exodus.
    Sample Complexity of Learning Heuristic Functions for Greedy-Best-First and A* Search. (arXiv:2205.09963v1 [cs.LG])
    Greedy best-first search (GBFS) and A* search (A*) are popular algorithms for path-finding on large graphs. Both use so-called heuristic functions, which estimate how close a vertex is to the goal. While heuristic functions have been handcrafted using domain knowledge, recent studies demonstrate that learning heuristic functions from data is effective in many applications. Motivated by this emerging approach, we study the sample complexity of learning heuristic functions for GBFS and A*. We build on a recent framework called \textit{data-driven algorithm design} and evaluate the \textit{pseudo-dimension} of a class of utility functions that measure the performance of parameterized algorithms. Assuming that a vertex set of size $n$ is fixed, we present $\mathrm{O}(n\lg n)$ and $\mathrm{O}(n^2\lg n)$ upper bounds on the pseudo-dimensions for GBFS and A*, respectively, parameterized by heuristic function values. The upper bound for A* can be improved to $\mathrm{O}(n^2\lg d)$ if every vertex has a degree of at most $d$ and to $\mathrm{O}(n \lg n)$ if edge weights are integers bounded by $\mathrm{poly}(n)$. We also give $\Omega(n)$ lower bounds for GBFS and A*, which imply that our bounds for GBFS and A* under the integer-weight condition are tight up to a $\lg n$ factor. Finally, we discuss a case where the performance of A* is measured by the suboptimality and show that we can sometimes obtain a better guarantee by combining a parameter-dependent worst-case bound with a sample complexity bound.
    Towards Explanation for Unsupervised Graph-Level Representation Learning. (arXiv:2205.09934v1 [cs.LG])
    Due to the superior performance of Graph Neural Networks (GNNs) in various domains, there is an increasing interest in the GNN explanation problem "\emph{which fraction of the input graph is the most crucial to decide the model's decision?}" Existing explanation methods focus on the supervised settings, \eg, node classification and graph classification, while the explanation for unsupervised graph-level representation learning is still unexplored. The opaqueness of the graph representations may lead to unexpected risks when deployed for high-stake decision-making scenarios. In this paper, we advance the Information Bottleneck principle (IB) to tackle the proposed explanation problem for unsupervised graph representations, which leads to a novel principle, \textit{Unsupervised Subgraph Information Bottleneck} (USIB). We also theoretically analyze the connection between graph representations and explanatory subgraphs on the label space, which reveals that the expressiveness and robustness of representations benefit the fidelity of explanatory subgraphs. Experimental results on both synthetic and real-world datasets demonstrate the superiority of our developed explainer and the validity of our theoretical analysis.
    Towards Consistency in Adversarial Classification. (arXiv:2205.10022v1 [cs.LG])
    In this paper, we study the problem of consistency in the context of adversarial examples. Specifically, we tackle the following question: can surrogate losses still be used as a proxy for minimizing the $0/1$ loss in the presence of an adversary that alters the inputs at test-time? Different from the standard classification task, this question cannot be reduced to a point-wise minimization problem, and calibration needs not to be sufficient to ensure consistency. In this paper, we expose some pathological behaviors specific to the adversarial problem, and show that no convex surrogate loss can be consistent or calibrated in this context. It is therefore necessary to design another class of surrogate functions that can be used to solve the adversarial consistency issue. As a first step towards designing such a class, we identify sufficient and necessary conditions for a surrogate loss to be calibrated in both the adversarial and standard settings. Finally, we give some directions for building a class of losses that could be consistent in the adversarial framework.
    Bayesian Active Learning with Fully Bayesian Gaussian Processes. (arXiv:2205.10186v1 [cs.LG])
    The bias-variance trade-off is a well-known problem in machine learning that only gets more pronounced the less available data there is. In active learning, where labeled data is scarce or difficult to obtain, neglecting this trade-off can cause inefficient and non-optimal querying, leading to unnecessary data labeling. In this paper, we focus on active learning with Gaussian Processes (GPs). For the GP, the bias-variance trade-off is made by optimization of the two hyperparameters: the length scale and noise-term. Considering that the optimal mode of the joint posterior of the hyperparameters is equivalent to the optimal bias-variance trade-off, we approximate this joint posterior and utilize it to design two new acquisition functions. The first one is a Bayesian variant of Query-by-Committee (B-QBC), and the second is an extension that explicitly minimizes the predictive variance through a Query by Mixture of Gaussian Processes (QB-MGP) formulation. Across six common simulators, we empirically show that B-QBC, on average, achieves the best marginal likelihood, whereas QB-MGP achieves the best predictive performance. We show that incorporating the bias-variance trade-off in the acquisition functions mitigates unnecessary and expensive data labeling.
    SafeNet: Mitigating Data Poisoning Attacks on Private Machine Learning. (arXiv:2205.09986v1 [cs.CR])
    Secure multiparty computation (MPC) has been proposed to allow multiple mutually distrustful data owners to jointly train machine learning (ML) models on their combined data. However, the datasets used for training ML models might be under the control of an adversary mounting a data poisoning attack, and MPC prevents inspecting training sets to detect poisoning. We show that multiple MPC frameworks for private ML training are susceptible to backdoor and targeted poisoning attacks. To mitigate this, we propose SafeNet, a framework for building ensemble models in MPC with formal guarantees of robustness to data poisoning attacks. We extend the security definition of private ML training to account for poisoning and prove that our SafeNet design satisfies the definition. We demonstrate SafeNet's efficiency, accuracy, and resilience to poisoning on several machine learning datasets and models. For instance, SafeNet reduces backdoor attack success from 100% to 0% for a neural network model, while achieving 39x faster training and 36x less communication than the four-party MPC framework of Dalskov et al.
    Transformer with Memory Replay. (arXiv:2205.09869v1 [cs.LG])
    Transformers achieve state-of-the-art performance for natural language processing tasks by pre-training on large-scale text corpora. They are extremely compute-intensive and have very high sample complexity. Memory replay is a mechanism that remembers and reuses past examples by saving to and replaying from a memory buffer. It has been successfully used in reinforcement learning and GANs due to better sample efficiency. In this paper, we propose \emph{Transformer with Memory Replay} (TMR), which integrates memory replay with transformer, making transformer more sample-efficient. Experiments on GLUE and SQuAD benchmark datasets show that Transformer with Memory Replay achieves at least $1\%$ point increase compared to the baseline transformer model when pretrained with the same number of examples. Further, by adopting a careful design that reduces the wall-clock time overhead of memory replay, we also empirically achieve a better runtime efficiency.
    Diverse super-resolution with pretrained deep hiererarchical VAEs. (arXiv:2205.10347v1 [cs.CV])
    Image super-resolution is a one-to-many problem, but most deep-learning based methods only provide one single solution to this problem. In this work, we tackle the problem of diverse super-resolution by reusing VD-VAE, a state-of-the art variational autoencoder (VAE). We find that the hierarchical latent representation learned by VD-VAE naturally separates the image low-frequency information, encoded in the latent groups at the top of the hierarchy, from the image high-frequency details, determined by the latent groups at the bottom of the latent hierarchy. Starting from this observation, we design a super-resolution model exploiting the specific structure of VD-VAE latent space. Specifically, we train an encoder to encode low-resolution images in the subset of VD-VAE latent space encoding the low-frequency information, and we combine this encoder with VD-VAE generative model to sample diverse super-resolved version of a low-resolution input. We demonstrate the ability of our method to generate diverse solutions to the super-resolution problem on face super-resolution with upsampling factors x4, x8, and x16.  ( 2 min )
    On pseudo-absence generation and machine learning for locust breeding ground prediction in Africa. (arXiv:2111.03904v2 [cs.LG] UPDATED)
    Desert locust outbreaks threaten the food security of a large part of Africa and have affected the livelihoods of millions of people over the years. Machine learning (ML) has been demonstrated as an effective approach to locust distribution modelling which could assist in early warning. ML requires a significant amount of labelled data to train. Most publicly available labelled data on locusts are presence-only data, where only the sightings of locusts being present at a location are recorded. Therefore, prior work using ML have resorted to pseudo-absence generation methods as a way to circumvent this issue. The most commonly used approach is to randomly sample points in a region of interest while ensuring that these sampled pseudo-absence points are at least a specific distance away from true presence points. In this paper, we compare this random sampling approach to more advanced pseudo-absence generation methods, such as environmental profiling and optimal background extent limitation, specifically for predicting desert locust breeding grounds in Africa. Interestingly, we find that for the algorithms we tested, namely logistic regression, gradient boosting, random forests and maximum entropy, all popular in prior work, the logistic model performed significantly better than the more sophisticated ensemble methods, both in terms of prediction accuracy and F1 score. Although background extent limitation combined with random sampling boosted performance for ensemble methods, for LR this was not the case, and instead, a significant improvement was obtained when using environmental profiling. In light of this, we conclude that a simpler ML approach such as logistic regression combined with more advanced pseudo-absence generation, specifically environmental profiling, can be a sensible and effective approach to predicting locust breeding grounds across Africa.  ( 3 min )
    Recurrent segmentation meets block models in temporal networks. (arXiv:2205.09862v1 [cs.SI])
    A popular approach to model interactions is to represent them as a network with nodes being the agents and the interactions being the edges. Interactions are often timestamped, which leads to having timestamped edges. Many real-world temporal networks have a recurrent or possibly cyclic behaviour. For example, social network activity may be heightened during certain hours of day. In this paper, our main interest is to model recurrent activity in such temporal networks. As a starting point we use stochastic block model, a popular choice for modelling static networks, where nodes are split into $R$ groups. We extend this model to temporal networks by modelling the edges with a Poisson process. We make the parameters of the process dependent on time by segmenting the time line into $K$ segments. To enforce the recurring activity we require that only $H < K$ different set of parameters can be used, that is, several, not necessarily consecutive, segments must share their parameters. We prove that the searching for optimal blocks and segmentation is an NP-hard problem. Consequently, we split the problem into 3 subproblems where we optimize blocks, model parameters, and segmentation in turn while keeping the remaining structures fixed. We propose an iterative algorithm that requires $O(KHm + Rn + R^2H)$ time per iteration, where $n$ and $m$ are the number of nodes and edges in the network. We demonstrate experimentally that the number of required iterations is typically low, the algorithm is able to discover the ground truth from synthetic datasets, and show that certain real-world networks exhibit recurrent behaviour as the likelihood does not deteriorate when $H$ is lowered.  ( 2 min )
    Cross Reconstruction Transformer for Self-Supervised Time Series Representation Learning. (arXiv:2205.09928v1 [cs.LG])
    Unsupervised/self-supervised representation learning in time series is critical since labeled samples are usually scarce in real-world scenarios. Existing approaches mainly leverage the contrastive learning framework, which automatically learns to understand the similar and dissimilar data pairs. Nevertheless, they are restricted to the prior knowledge of constructing pairs, cumbersome sampling policy, and unstable performances when encountering sampling bias. Also, few works have focused on effectively modeling across temporal-spectral relations to extend the capacity of representations. In this paper, we aim at learning representations for time series from a new perspective and propose Cross Reconstruction Transformer (CRT) to solve the aforementioned problems in a unified way. CRT achieves time series representation learning through a cross-domain dropping-reconstruction task. Specifically, we transform time series into the frequency domain and randomly drop certain parts in both time and frequency domains. Dropping can maximally preserve the global context compared to cropping and masking. Then a transformer architecture is utilized to adequately capture the cross-domain correlations between temporal and spectral information through reconstructing data in both domains, which is called Dropped Temporal-Spectral Modeling. To discriminate the representations in global latent space, we propose Instance Discrimination Constraint to reduce the mutual information between different time series and sharpen the decision boundaries. Additionally, we propose a specified curriculum learning strategy to optimize the CRT, which progressively increases the dropping ratio in the training process.  ( 2 min )
    ClusterEA: Scalable Entity Alignment with Stochastic Training and Normalized Mini-batch Similarities. (arXiv:2205.10312v1 [cs.DB])
    Entity alignment (EA) aims at finding equivalent entities in different knowledge graphs (KGs). Embedding-based approaches have dominated the EA task in recent years. Those methods face problems that come from the geometric properties of embedding vectors, including hubness and isolation. To solve these geometric problems, many normalization approaches have been adopted to EA. However, the increasing scale of KGs renders it is hard for EA models to adopt the normalization processes, thus limiting their usage in real-world applications. To tackle this challenge, we present ClusterEA, a general framework that is capable of scaling up EA models and enhancing their results by leveraging normalization methods on mini-batches with a high entity equivalent rate. ClusterEA contains three components to align entities between large-scale KGs, including stochastic training, ClusterSampler, and SparseFusion. It first trains a large-scale Siamese GNN for EA in a stochastic fashion to produce entity embeddings. Based on the embeddings, a novel ClusterSampler strategy is proposed for sampling highly overlapped mini-batches. Finally, ClusterEA incorporates SparseFusion, which normalizes local and global similarity and then fuses all similarity matrices to obtain the final similarity matrix. Extensive experiments with real-life datasets on EA benchmarks offer insight into the proposed framework, and suggest that it is capable of outperforming the state-of-the-art scalable EA framework by up to 8 times in terms of Hits@1.  ( 2 min )
    On the SDEs and Scaling Rules for Adaptive Gradient Algorithms. (arXiv:2205.10287v1 [cs.LG])
    Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.  ( 2 min )
    HeadText: Exploring Hands-free Text Entry using Head Gestures by Motion Sensing on a Smart Earpiece. (arXiv:2205.09978v1 [cs.HC])
    We present HeadText, a hands-free technique on a smart earpiece for text entry by motion sensing. Users input text utilizing only 7 head gestures for key selection, word selection, word commitment and word cancelling tasks. Head gesture recognition is supported by motion sensing on a smart earpiece to capture head moving signals and machine learning algorithms (K-Nearest-Neighbor (KNN) with a Dynamic Time Warping (DTW) distance measurement). A 10-participant user study proved that HeadText could recognize 7 head gestures at an accuracy of 94.29%. After that, the second user study presented that HeadText could achieve a maximum accuracy of 10.65 WPM and an average accuracy of 9.84 WPM for text entry. Finally, we demonstrate potential applications of HeadText in hands-free scenarios for (a). text entry of people with motor impairments, (b). private text entry, and (c). socially acceptable text entry.  ( 2 min )
    Interpolating Compressed Parameter Subspaces. (arXiv:2205.09891v1 [cs.LG])
    Inspired by recent work on neural subspaces and mode connectivity, we revisit parameter subspace sampling for shifted and/or interpolatable input distributions (instead of a single, unshifted distribution). We enforce a compressed geometric structure upon a set of trained parameters mapped to a set of train-time distributions, denoting the resulting subspaces as Compressed Parameter Subspaces (CPS). We show the success and failure modes of the types of shifted distributions whose optimal parameters reside in the CPS. We find that ensembling point-estimates within a CPS can yield a high average accuracy across a range of test-time distributions, including backdoor, adversarial, permutation, stylization and rotation perturbations. We also find that the CPS can contain low-loss point-estimates for various task shifts (albeit interpolated, perturbed, unseen or non-identical coarse labels). We further demonstrate this property in a continual learning setting with CIFAR100.  ( 2 min )
    ExMo: Explainable AI Model using Inverse Frequency Decision Rules. (arXiv:2205.10045v1 [cs.AI])
    In this paper, we present a novel method to compute decision rules to build a more accurate interpretable machine learning model, denoted as ExMo. The ExMo interpretable machine learning model consists of a list of IF...THEN... statements with a decision rule in the condition. This way, ExMo naturally provides an explanation for a prediction using the decision rule that was triggered. ExMo uses a new approach to extract decision rules from the training data using term frequency-inverse document frequency (TF-IDF) features. With TF-IDF, decision rules with feature values that are more relevant to each class are extracted. Hence, the decision rules obtained by ExMo can distinguish the positive and negative classes better than the decision rules used in the existing Bayesian Rule List (BRL) algorithm, obtained using the frequent pattern mining approach. The paper also shows that ExMo learns a qualitatively better model than BRL. Furthermore, ExMo demonstrates that the textual explanation can be provided in a human-friendly way so that the explanation can be easily understood by non-expert users. We validate ExMo on several datasets with different sizes to evaluate its efficacy. Experimental validation on a real-world fraud detection application shows that ExMo is 20% more accurate than BRL and that it achieves accuracy similar to those of deep learning models.  ( 2 min )
    Visualizing and Explaining Language Models. (arXiv:2205.10238v1 [cs.CL])
    During the last decade, Natural Language Processing has become, after Computer Vision, the second field of Artificial Intelligence that was massively changed by the advent of Deep Learning. Regardless of the architecture, the language models of the day need to be able to process or generate text, as well as predict missing words, sentences or relations depending on the task. Due to their black-box nature, such models are difficult to interpret and explain to third parties. Visualization is often the bridge that language model designers use to explain their work, as the coloring of the salient words and phrases, clustering or neuron activations can be used to quickly understand the underlying models. This paper showcases the techniques used in some of the most popular Deep Learning for NLP visualizations, with a special focus on interpretability and explainability.  ( 2 min )
    Understanding and Mitigating the Uncertainty in Zero-Shot Translation. (arXiv:2205.10068v1 [cs.CL])
    Zero-shot translation is a promising direction for building a comprehensive multilingual neural machine translation (MNMT) system. However, its quality is still not satisfactory due to off-target issues. In this paper, we aim to understand and alleviate the off-target issues from the perspective of uncertainty in zero-shot translation. By carefully examining the translation output and model confidence, we identify two uncertainties that are responsible for the off-target issues, namely, extrinsic data uncertainty and intrinsic model uncertainty. Based on the observations, we propose two light-weight and complementary approaches to denoise the training data for model training, and mask out the vocabulary of the off-target languages in inference. Extensive experiments on both balanced and unbalanced datasets show that our approaches significantly improve the performance of zero-shot translation over strong MNMT baselines. Qualitative analyses provide insights into where our approaches reduce off-target translations  ( 2 min )
    Estimating the frame potential of large-scale quantum circuit sampling using tensor networks up to 50 qubits. (arXiv:2205.09900v1 [quant-ph])
    We develop numerical protocols for estimating the frame potential, the 2-norm distance between a given ensemble and the exact Haar randomness, using the \texttt{QTensor} platform. Our tensor-network-based algorithm has polynomial complexity for shallow circuits and is high performing using CPU and GPU parallelism. We apply the above methods to two problems: the Brown-Susskind conjecture, with local and parallel random circuits in terms of the Haar distance and the approximate $k$-design properties of the hardware efficient ans{\"a}tze in quantum machine learning, which induce the barren plateau problem. We estimate frame potentials with these ensembles up to 50 qubits and $k=5$, examine the Haar distance of the hardware-efficient ans{\"a}tze, and verify the Brown-Susskind conjecture numerically. Our work shows that large-scale tensor network simulations could provide important hints toward open problems in quantum information science.  ( 2 min )
    FairNorm: Fair and Fast Graph Neural Network Training. (arXiv:2205.09977v1 [cs.LG])
    Graph neural networks (GNNs) have been demonstrated to achieve state-of-the-art for a number of graph-based learning tasks, which leads to a rise in their employment in various domains. However, it has been shown that GNNs may inherit and even amplify bias within training data, which leads to unfair results towards certain sensitive groups. Meanwhile, training of GNNs introduces additional challenges, such as slow convergence and possible instability. Faced with these limitations, this work proposes FairNorm, a unified normalization framework that reduces the bias in GNN-based learning while also providing provably faster convergence. Specifically, FairNorm employs fairness-aware normalization operators over different sensitive groups with learnable parameters to reduce the bias in GNNs. The design of FairNorm is built upon analyses that illuminate the sources of bias in graph-based learning. Experiments on node classification over real-world networks demonstrate the efficiency of the proposed scheme in improving fairness in terms of statistical parity and equal opportunity compared to fairness-aware baselines. In addition, it is empirically shown that the proposed framework leads to faster convergence compared to the naive baseline where no normalization is employed.  ( 2 min )
    Swapping Semantic Contents for Mixing Images. (arXiv:2205.10158v1 [cs.CV])
    Deep architecture have proven capable of solving many tasks provided a sufficient amount of labeled data. In fact, the amount of available labeled data has become the principal bottleneck in low label settings such as Semi-Supervised Learning. Mixing Data Augmentations do not typically yield new labeled samples, as indiscriminately mixing contents creates between-class samples. In this work, we introduce the SciMix framework that can learn to generator to embed a semantic style code into image backgrounds, we obtain new mixing scheme for data augmentation. We then demonstrate that SciMix yields novel mixed samples that inherit many characteristics from their non-semantic parents. Afterwards, we verify those samples can be used to improve the performance semi-supervised frameworks like Mean Teacher or Fixmatch, and even fully supervised learning on a small labeled dataset.  ( 2 min )
    On Calibration of Ensemble-Based Credal Predictors. (arXiv:2205.10082v1 [stat.ML])
    In recent years, several classification methods that intend to quantify epistemic uncertainty have been proposed, either by producing predictions in the form of second-order distributions or sets of probability distributions. In this work, we focus on the latter, also called credal predictors, and address the question of how to evaluate them: What does it mean that a credal predictor represents epistemic uncertainty in a faithful manner? To answer this question, we refer to the notion of calibration of probabilistic predictors and extend it to credal predictors. Broadly speaking, we call a credal predictor calibrated if it returns sets that cover the true conditional probability distribution. To verify this property for the important case of ensemble-based credal predictors, we propose a novel nonparametric calibration test that generalizes an existing test for probabilistic predictors to the case of credal predictors. Making use of this test, we empirically show that credal predictors based on deep neural networks are often not well calibrated.  ( 2 min )
    Trend analysis and forecasting air pollution in Rwanda. (arXiv:2205.10024v1 [stat.ML])
    Air pollution is a major public health problem worldwide although the lack of data is a global issue for most low and middle income countries. Ambient air pollution in the form of fine particulate matter (PM2.5) exceeds the World Health Organization guidelines in Rwanda with a daily average of around 42.6 microgram per meter cube. Monitoring and mitigation strategies require an expensive investment in equipment to collect pollution data. Low-cost sensor technology and machine learning methods have appeared as an alternative solution to get reliable information for decision making. This paper analyzes the trend of air pollution in Rwanda and proposes forecasting models suitable to data collected by a network of low-cost sensors deployed in Rwanda.  ( 2 min )
    Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome. (arXiv:2205.09906v1 [stat.ML])
    Data augmentation plays a key role in modern machine learning pipelines. While numerous augmentation strategies have been studied in the context of computer vision and natural language processing, less is known for other data modalities. Our work extends the success of data augmentation to compositional data, i.e., simplex-valued data, which is of particular interest in the context of the human microbiome. Drawing on key principles from compositional data analysis, such as the Aitchison geometry of the simplex and subcompositions, we define novel augmentation strategies for this data modality. Incorporating our data augmentations into standard supervised learning pipelines results in consistent performance gains across a wide range of standard benchmark datasets. In particular, we set a new state-of-the-art for key disease prediction tasks including colorectal cancer, type 2 diabetes, and Crohn's disease. In addition, our data augmentations enable us to define a novel contrastive learning model, which improves on previous representation learning approaches for microbiome compositional data. Our code is available at https://github.com/cunningham-lab/AugCoDa.  ( 2 min )
    Conformal Prediction with Temporal Quantile Adjustments. (arXiv:2205.09940v1 [stat.ML])
    We develop Temporal Quantile Adjustment (TQA), a general method to construct efficient and valid prediction intervals (PIs) for regression on cross-sectional time series data. Such data is common in many domains, including econometrics and healthcare. A canonical example in healthcare is predicting patient outcomes using physiological time-series data, where a population of patients composes a cross-section. Reliable PI estimators in this setting must address two distinct notions of coverage: cross-sectional coverage across a cross-sectional slice, and longitudinal coverage along the temporal dimension for each time series. Recent works have explored adapting Conformal Prediction (CP) to obtain PIs in the time series context. However, none handles both notions of coverage simultaneously. CP methods typically query a pre-specified quantile from the distribution of nonconformity scores on a calibration set. TQA adjusts the quantile to query in CP at each time $t$, accounting for both cross-sectional and longitudinal coverage in a theoretically-grounded manner. The post-hoc nature of TQA facilitates its use as a general wrapper around any time series regression model. We validate TQA's performance through extensive experimentation: TQA generally obtains efficient PIs and improves longitudinal coverage while preserving cross-sectional coverage.  ( 2 min )
    A General Framework for quantifying Aleatoric and Epistemic uncertainty in Graph Neural Networks. (arXiv:2205.09968v1 [cs.LG])
    Graph Neural Networks (GNN) provide a powerful framework that elegantly integrates Graph theory with Machine learning for modeling and analysis of networked data. We consider the problem of quantifying the uncertainty in predictions of GNN stemming from modeling errors and measurement uncertainty. We consider aleatoric uncertainty in the form of probabilistic links and noise in feature vector of nodes, while epistemic uncertainty is incorporated via a probability distribution over the model parameters. We propose a unified approach to treat both sources of uncertainty in a Bayesian framework, where Assumed Density Filtering is used to quantify aleatoric uncertainty and Monte Carlo dropout captures uncertainty in model parameters. Finally, the two sources of uncertainty are aggregated to estimate the total uncertainty in predictions of a GNN. Results in the real-world datasets demonstrate that the Bayesian model performs at par with a frequentist model and provides additional information about predictions uncertainty that are sensitive to uncertainties in the data and model.  ( 2 min )
    The Unreasonable Effectiveness of Deep Evidential Regression. (arXiv:2205.10060v1 [cs.LG])
    There is a significant need for principled uncertainty reasoning in machine learning systems as they are increasingly deployed in safety-critical domains. A new approach with uncertainty-aware regression-based neural networks (NNs), based on learning evidential distributions for aleatoric and epistemic uncertainties, shows promise over traditional deterministic methods and typical Bayesian NNs, notably with the capabilities to disentangle aleatoric and epistemic uncertainties. Despite some empirical success of Deep Evidential Regression (DER), there are important gaps in the mathematical foundation that raise the question of why the proposed technique seemingly works. We detail the theoretical shortcomings and analyze the performance on synthetic and real-world data sets, showing that Deep Evidential Regression is a heuristic rather than an exact uncertainty quantification. We go on to propose corrections and redefinitions of how aleatoric and epistemic uncertainties should be extracted from NNs.  ( 2 min )
    Summarization as Indirect Supervision for Relation Extraction. (arXiv:2205.09837v1 [cs.CL])
    Relation extraction (RE) models have been challenged by their reliance on training data with expensive annotations. Considering that summarization tasks aim at acquiring concise expressions of synoptical information from the longer context, these tasks naturally align with the objective of RE, i.e., extracting a kind of synoptical information that describes the relation of entity mentions. We present SuRE, which converts RE into a summarization formulation. SuRE leads to more precise and resource-efficient RE based on indirect supervision from summarization tasks. To achieve this goal, we develop sentence and relation conversion techniques that essentially bridge the formulation of summarization and RE tasks. We also incorporate constraint decoding techniques with Trie scoring to further enhance summarization-based RE with robust inference. Experiments on three RE datasets demonstrate the effectiveness of SuRE in both full-dataset and low-resource settings, showing that summarization is a promising source of indirect supervision to improve RE models.  ( 2 min )
    Self-Supervised Depth Estimation with Isometric-Self-Sample-Based Learning. (arXiv:2205.10006v1 [cs.CV])
    Managing the dynamic regions in the photometric loss formulation has been a main issue for handling the self-supervised depth estimation problem. Most previous methods have alleviated this issue by removing the dynamic regions in the photometric loss formulation based on the masks estimated from another module, making it difficult to fully utilize the training images. In this paper, to handle this problem, we propose an isometric self-sample-based learning (ISSL) method to fully utilize the training images in a simple yet effective way. The proposed method provides additional supervision during training using self-generated images that comply with pure static scene assumption. Specifically, the isometric self-sample generator synthesizes self-samples for each training image by applying random rigid transformations on the estimated depth. Thus both the generated self-samples and the corresponding training image always follow the static scene assumption. We show that plugging our ISSL module into several existing models consistently improves the performance by a large margin. In addition, it also boosts the depth accuracy over different types of scene, i.e., outdoor scenes (KITTI and Make3D) and indoor scene (NYUv2), validating its high effectiveness.  ( 2 min )
    Deep Learning Methods for Proximal Inference via Maximum Moment Restriction. (arXiv:2205.09824v1 [stat.ML])
    The No Unmeasured Confounding Assumption is widely used to identify causal effects in observational studies. Recent work on proximal inference has provided alternative identification results that succeed even in the presence of unobserved confounders, provided that one has measured a sufficiently rich set of proxy variables, satisfying specific structural conditions. However, proximal inference requires solving an ill-posed integral equation. Previous approaches have used a variety of machine learning techniques to estimate a solution to this integral equation, commonly referred to as the bridge function. However, prior work has often been limited by relying on pre-specified kernel functions, which are not data adaptive and struggle to scale to large datasets. In this work, we introduce a flexible and scalable method based on a deep neural network to estimate causal effects in the presence of unmeasured confounding using proximal inference. Our method achieves state of the art performance on two well-established proximal inference benchmarks. Finally, we provide theoretical consistency guarantees for our method.  ( 2 min )
    Survey on Fair Reinforcement Learning: Theory and Practice. (arXiv:2205.10032v1 [cs.LG])
    Fairness-aware learning aims at satisfying various fairness constraints in addition to the usual performance criteria via data-driven machine learning techniques. Most of the research in fairness-aware learning employs the setting of fair-supervised learning. However, many dynamic real-world applications can be better modeled using sequential decision-making problems and fair reinforcement learning provides a more suitable alternative for addressing these problems. In this article, we provide an extensive overview of fairness approaches that have been implemented via a reinforcement learning (RL) framework. We discuss various practical applications in which RL methods have been applied to achieve a fair solution with high accuracy. We further include various facets of the theory of fair reinforcement learning, organizing them into single-agent RL, multi-agent RL, long-term fairness via RL, and offline learning. Moreover, we highlight a few major issues to explore in order to advance the field of fair-RL, namely - i) correcting societal biases, ii) feasibility of group fairness or individual fairness, and iii) explainability in RL. Our work is beneficial for both researchers and practitioners as we discuss articles providing mathematical guarantees as well as articles with empirical studies on real-world problems.  ( 2 min )
    FedNoiL: A Simple Two-Level Sampling Method for Federated Learning with Noisy Labels. (arXiv:2205.10110v1 [cs.LG])
    Federated learning (FL) aims at training a global model on the server side while the training data are collected and located at the local devices. Hence, the labels in practice are usually annotated by clients of varying expertise or criteria and thus contain different amounts of noises. Local training on noisy labels can easily result in overfitting to noisy labels, which is devastating to the global model through aggregation. Although recent robust FL methods take malicious clients into account, they have not addressed local noisy labels on each device and the impact to the global model. In this paper, we develop a simple two-level sampling method "FedNoiL" that (1) selects clients for more robust global aggregation on the server; and (2) selects clean labels and correct pseudo-labels at the client end for more robust local training. The sampling probabilities are built upon clean label detection by the global model. Moreover, we investigate different schedules changing the local epochs between aggregations over the course of FL, which notably improves the communication and computation efficiency in noisy label setting. In experiments with homogeneous/heterogeneous data distributions and noise ratios, we observed that direct combinations of SOTA FL methods with SOTA noisy-label learning methods can easily fail but our method consistently achieves better and robust performance.  ( 2 min )
    Exploring Extreme Parameter Compression for Pre-trained Language Models. (arXiv:2205.10036v1 [cs.CL])
    Recent work explored the potential of large-scale Transformer-based pre-trained models, especially Pre-trained Language Models (PLMs) in natural language processing. This raises many concerns from various perspectives, e.g., financial costs and carbon emissions. Compressing PLMs like BERT with negligible performance loss for faster inference and cheaper deployment has attracted much attention. In this work, we aim to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one. Two decomposition and reconstruction protocols are further proposed to improve the effectiveness and efficiency during compression. Our compressed BERT with ${1}/{7}$ parameters in Transformer layers performs on-par with, sometimes slightly better than the original BERT in GLUE benchmark. A tiny version achieves $96.7\%$ performance of BERT-base with $ {1}/{48} $ encoder parameters (i.e., less than 2M parameters excluding the embedding layer) and $2.7 \times$ faster on inference. To show that the proposed method is orthogonal to existing compression methods like knowledge distillation, we also explore the benefit of the proposed method on a distilled BERT.  ( 2 min )
    MiDAS: Multi-integrated Domain Adaptive Supervision for Fake News Detection. (arXiv:2205.09817v1 [cs.LG])
    COVID-19 related misinformation and fake news, coined an 'infodemic', has dramatically increased over the past few years. This misinformation exhibits concept drift, where the distribution of fake news changes over time, reducing effectiveness of previously trained models for fake news detection. Given a set of fake news models trained on multiple domains, we propose an adaptive decision module to select the best-fit model for a new sample. We propose MiDAS, a multi-domain adaptative approach for fake news detection that ranks relevancy of existing models to new samples. MiDAS contains 2 components: a doman-invariant encoder, and an adaptive model selector. MiDAS integrates multiple pre-trained and fine-tuned models with their training data to create a domain-invariant representation. Then, MiDAS uses local Lipschitz smoothness of the invariant embedding space to estimate each model's relevance to a new sample. Higher ranked models provide predictions, and lower ranked models abstain. We evaluate MiDAS on generalization to drifted data with 9 fake news datasets, each obtained from different domains and modalities. MiDAS achieves new state-of-the-art performance on multi-domain adaptation for out-of-distribution fake news classification.  ( 2 min )
    Neural Additive Models for Nowcasting. (arXiv:2205.10020v1 [cs.LG])
    Deep neural networks (DNNs) are one of the most highlighted methods in machine learning. However, as DNNs are black-box models, they lack explanatory power for their predictions. Recently, neural additive models (NAMs) have been proposed to provide this power while maintaining high prediction performance. In this paper, we propose a novel NAM approach for multivariate nowcasting (NC) problems, which comprise an important focus area of machine learning. For the multivariate time-series data used in NC problems, explanations should be considered for every input value to the variables at distinguishable time steps. By employing generalized additive models, the proposed NAM-NC successfully explains each input value's importance for multiple variables and time steps. Experimental results involving a toy example and two real-world datasets show that the NAM-NC predicts multivariate time-series data as accurately as state-of-the-art neural networks, while also providing the explanatory importance of each input value. We also examine parameter-sharing networks using NAM-NC to decrease their complexity, and NAM-MC's hard-tied feature net extracted explanations with good performance.  ( 2 min )
    Service Delay Minimization for Federated Learning over Mobile Devices. (arXiv:2205.09868v1 [cs.LG])
    Federated learning (FL) over mobile devices has fostered numerous intriguing applications/services, many of which are delay-sensitive. In this paper, we propose a service delay efficient FL (SDEFL) scheme over mobile devices. Unlike traditional communication efficient FL, which regards wireless communications as the bottleneck, we find that under many situations, the local computing delay is comparable to the communication delay during the FL training process, given the development of high-speed wireless transmission techniques. Thus, the service delay in FL should be computing delay + communication delay over training rounds. To minimize the service delay of FL, simply reducing local computing/communication delay independently is not enough. The delay trade-off between local computing and wireless communications must be considered. Besides, we empirically study the impacts of local computing control and compression strategies (i.e., the number of local updates, weight quantization, and gradient quantization) on computing, communication and service delays. Based on those trade-off observation and empirical studies, we develop an optimization scheme to minimize the service delay of FL over heterogeneous devices. We establish testbeds and conduct extensive emulations/experiments to verify our theoretical analysis. The results show that SDEFL reduces notable service delay with a small accuracy drop compared to peer designs.  ( 2 min )
    Residual Dynamic Mode Decomposition: Robust and verified Koopmanism. (arXiv:2205.09779v1 [physics.flu-dyn])
    Dynamic Mode Decomposition (DMD) describes complex dynamic processes through a hierarchy of simpler coherent features. DMD is regularly used to understand the fundamental characteristics of turbulence and is closely related to Koopman operators. However, verifying the decomposition, equivalently the computed spectral features of Koopman operators, remains a major challenge due to the infinite-dimensional nature of Koopman operators. Challenges include spurious (unphysical) modes, and dealing with continuous spectra, both of which occur regularly in turbulent flows. Residual Dynamic Mode Decomposition (ResDMD), introduced by (Colbrook & Townsend 2021), overcomes some of these challenges through the data-driven computation of residuals associated with the full infinite-dimensional Koopman operator. ResDMD computes spectra and pseudospectra of general Koopman operators with error control, and computes smoothed approximations of spectral measures (including continuous spectra) with explicit high-order convergence theorems. ResDMD thus provides robust and verified Koopmanism. We implement ResDMD and demonstrate its application in a variety of fluid dynamic situations, at varying Reynolds numbers, arising from both numerical and experimental data. Examples include: vortex shedding behind a cylinder; hot-wire data acquired in a turbulent boundary layer; particle image velocimetry data focusing on a wall-jet flow; and acoustic pressure signals of laser-induced plasma. We present some advantages of ResDMD, namely, the ability to verifiably resolve non-linear, transient modes, and spectral calculation with reduced broadening effects. We also discuss how a new modal ordering based on residuals enables greater accuracy with a smaller dictionary than the traditional modulus ordering. This paves the way for greater dynamic compression of large datasets without sacrificing accuracy.  ( 2 min )
    Why GANs are overkill for NLP. (arXiv:2205.09838v1 [cs.LG])
    This work offers a novel theoretical perspective on why, despite numerous attempts, adversarial approaches to generative modeling (e.g., GANs) have not been as popular for certain generation tasks, particularly sequential tasks such as Natural Language Generation, as they have in others, such as Computer Vision. In particular, on sequential data such as text, maximum-likelihood approaches are significantly more utilized than GANs. We show that, while it may seem that maximizing likelihood is inherently different than minimizing distinguishability, this distinction is largely artificial and only holds for limited models. We argue that minimizing KL-divergence (i.e., maximizing likelihood) is a more efficient approach to effectively minimizing the same distinguishability criteria that adversarial models seek to optimize. Reductions show that minimizing distinguishability can be seen as simply boosting likelihood for certain families of models including n-gram models and neural networks with a softmax output layer. To achieve a full polynomial-time reduction, a novel next-token distinguishability model is considered.  ( 2 min )
    Discrete-Convex-Analysis-Based Framework for Warm-Starting Algorithms with Predictions. (arXiv:2205.09961v1 [cs.LG])
    Augmenting algorithms with learned predictions is a promising approach for going beyond worst-case bounds. Dinitz, Im, Lavastida, Moseley, and Vassilvitskii~(2021) have demonstrated that a warm start with learned dual solutions can improve the time complexity of the Hungarian method for weighted perfect bipartite matching. We extend and improve their framework in a principled manner via \textit{discrete convex analysis} (DCA), a discrete analog of convex analysis. We show the usefulness of our DCA-based framework by applying it to weighted perfect bipartite matching, weighted matroid intersection, and discrete energy minimization for computer vision. Our DCA-based framework yields time complexity bounds that depend on the $\ell_\infty$-distance from a predicted solution to an optimal solution, which has two advantages relative to the previous $\ell_1$-distance-dependent bounds: time complexity bounds are smaller, and learning of predictions is more sample efficient. We also discuss whether to learn primal or dual solutions from the DCA perspective.  ( 2 min )
    A Learning-Based Approach to Approximate Coded Computation. (arXiv:2205.09818v1 [cs.IT])
    Lagrange coded computation (LCC) is essential to solving problems about matrix polynomials in a coded distributed fashion; nevertheless, it can only solve the problems that are representable as matrix polynomials. In this paper, we propose AICC, an AI-aided learning approach that is inspired by LCC but also uses deep neural networks (DNNs). It is appropriate for coded computation of more general functions. Numerical simulations demonstrate the suitability of the proposed approach for the coded computation of different matrix functions that are often utilized in digital signal processing.  ( 2 min )
    Calibration Matters: Tackling Maximization Bias in Large-scale Advertising Recommendation Systems. (arXiv:2205.09809v1 [cs.LG])
    Calibration is defined as the ratio of the average predicted click rate to the true click rate. The optimization of calibration is essential to many online advertising recommendation systems because it directly affects the downstream bids in ads auctions and the amount of money charged to advertisers. Despite its importance, calibration optimization often suffers from a problem called "maximization bias". Maximization bias refers to the phenomenon that the maximum of predicted values overestimates the true maximum. The problem is introduced because the calibration is computed on the set selected by the prediction model itself. It persists even if unbiased predictions can be achieved on every datapoint and worsens when covariate shifts exist between the training and test sets. To mitigate this problem, we theorize the quantification of maximization bias and propose a variance-adjusting debiasing (VAD) meta-algorithm in this paper. The algorithm is efficient, robust, and practical as it is able to mitigate maximization bias problems under covariate shifts, neither incurring additional online serving costs nor compromising the ranking performance. We demonstrate the effectiveness of the proposed algorithm using a state-of-the-art recommendation neural network model on a large-scale real-world dataset.  ( 2 min )
    Estimation of Entropy in Constant Space with Improved Sample Complexity. (arXiv:2205.09804v1 [cs.DS])
    Recent work of Acharya et al. (NeurIPS 2019) showed how to estimate the entropy of a distribution $\mathcal D$ over an alphabet of size $k$ up to $\pm\epsilon$ additive error by streaming over $(k/\epsilon^3) \cdot \text{polylog}(1/\epsilon)$ i.i.d. samples and using only $O(1)$ words of memory. In this work, we give a new constant memory scheme that reduces the sample complexity to $(k/\epsilon^2)\cdot \text{polylog}(1/\epsilon)$. We conjecture that this is optimal up to $\text{polylog}(1/\epsilon)$ factors.  ( 2 min )
    Causal Discovery and Injection for Feed-Forward Neural Networks. (arXiv:2205.09787v1 [cs.LG])
    Neural networks have proven to be effective at solving a wide range of problems but it is often unclear whether they learn any meaningful causal relationship: this poses a problem for the robustness of neural network models and their use for high-stakes decisions. We propose a novel method overcoming this issue by injecting knowledge in the form of (possibly partial) causal graphs into feed-forward neural networks, so that the learnt model is guaranteed to conform to the graph, hence adhering to expert knowledge. This knowledge may be given up-front or during the learning process, to improve the model through human-AI collaboration. We apply our method to synthetic and real (tabular) data showing that it is robust against noise and can improve causal discovery and prediction performance in low data regimes.  ( 2 min )
    Label-invariant Augmentation for Semi-Supervised Graph Classification. (arXiv:2205.09802v1 [cs.CV])
    Recently, contrastiveness-based augmentation surges a new climax in the computer vision domain, where some operations, including rotation, crop, and flip, combined with dedicated algorithms, dramatically increase the model generalization and robustness. Following this trend, some pioneering attempts employ the similar idea to graph data. Nevertheless, unlike images, it is much more difficult to design reasonable augmentations without changing the nature of graphs. Although exciting, the current graph contrastive learning does not achieve as promising performance as visual contrastive learning. We conjecture the current performance of graph contrastive learning might be limited by the violation of the label-invariant augmentation assumption. In light of this, we propose a label-invariant augmentation for graph-structured data to address this challenge. Different from the node/edge modification and subgraph extraction, we conduct the augmentation in the representation space and generate the augmented samples in the most difficult direction while keeping the label of augmented data the same as the original samples. In the semi-supervised scenario, we demonstrate our proposed method outperforms the classical graph neural network based methods and recent graph contrastive learning on eight benchmark graph-structured data, followed by several in-depth experiments to further explore the label-invariant augmentation in several aspects.  ( 2 min )
    Graph Neural Networks Are More Powerful Than we Think. (arXiv:2205.09801v1 [cs.LG])
    Graph Neural Networks (GNNs) are powerful convolutional architectures that have shown remarkable performance in various node-level and graph-level tasks. Despite their success, the common belief is that the expressive power of GNNs is limited and that they are at most as discriminative as the Weisfeiler-Lehman (WL) algorithm. In this paper we argue the opposite and show that the WL algorithm is the upper bound only when the input to the GNN is the vector of all ones. In this direction, we derive an alternative analysis that employs linear algebraic tools and characterize the representational power of GNNs with respect to the eigenvalue decomposition of the graph operators. We show that GNNs can distinguish between any graphs that differ in at least one eigenvalue and design simple GNN architectures that are provably more expressive than the WL algorithm. Thorough experimental analysis on graph isomorphism and graph classification datasets corroborates our theoretical results and demonstrates the effectiveness of the proposed architectures.  ( 2 min )
    HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory Prediction via Scene Encoding. (arXiv:2205.09753v1 [cs.AI])
    One essential task for autonomous driving is to encode the information of a driving scene into vector representations so that the downstream task such as trajectory prediction could perform well. The driving scene is complicated, and there exists heterogeneity within elements, where they own diverse types of information i.e., agent dynamics, map routing, road lines, etc. Meanwhile, there also exist relativity across elements - meaning they have spatial relations with each other; such relations should be canonically represented regarding the relative measurements since the absolute value of the coordinate is meaningless. Taking these two observations into consideration, we propose a novel backbone, namely Heterogeneous Driving Graph Transformer (HDGT), which models the driving scene as a heterogeneous graph with different types of nodes and edges. For graph construction, each node represents either an agent or a road element and each edge represents their semantics relations such as Pedestrian-To-Crosswalk, Lane-To-Left-Lane. As for spatial relation encoding, instead of setting a fixed global reference, the coordinate information of the node as well as its in-edges is transformed to the local node-centric coordinate system. For the aggregation module in the graph neural network (GNN), we adopt the transformer structure in a hierarchical way to fit the heterogeneous nature of inputs. Experimental results show that the proposed method achieves new state-of-the-art on INTERACTION Prediction Challenge and Waymo Open Motion Challenge, in which we rank 1st and 2nd respectively regarding the minADE/minFDE metric.  ( 2 min )
  • Open

    Long-Range Transformers for Dynamic Spatiotemporal Forecasting. (arXiv:2109.12218v2 [cs.LG] UPDATED)
    Multivariate Time Series Forecasting focuses on the prediction of future values based on historical context. State-of-the-art sequence-to-sequence models rely on neural attention between timesteps, which allows for temporal learning but fails to consider distinct spatial relationships between variables. In contrast, methods based on graph neural networks explicitly model variable relationships. However, these methods often rely on predefined graphs and perform separate spatial and temporal updates without establishing direct connections between each variable at every timestep. This paper addresses these problems by translating multivariate forecasting into a spatiotemporal sequence formulation where each Transformer input token represents the value of a single variable at a given time. Long-Range Transformers can then learn interactions between space, time, and value information jointly along this extended sequence. Our method, which we call Spacetimeformer, achieves competitive results on benchmarks from traffic forecasting to electricity demand and weather prediction while learning fully-connected spatiotemporal relationships purely from data.
    Missing Data Imputation and Acquisition with Deep Hierarchical Models and Hamiltonian Monte Carlo. (arXiv:2202.04599v2 [cs.LG] UPDATED)
    Variational Autoencoders (VAEs) have recently been highly successful at imputing and acquiring heterogeneous missing data. However, within this specific application domain, existing VAE methods are restricted by using only one layer of latent variables and strictly Gaussian posterior approximations. To address these limitations, we present HH-VAEM, a Hierarchical VAE model for mixed-type incomplete data that uses Hamiltonian Monte Carlo with automatic hyper-parameter tuning for improved approximate inference. Our experiments show that HH-VAEM outperforms existing baselines in the tasks of missing data imputation and supervised learning with missing features. Finally, we also present a sampling-based approach for efficiently computing the information gain when missing features are to be acquired with HH-VAEM. Our experiments show that this sampling-based approach is superior to alternatives based on Gaussian approximations.
    Trend analysis and forecasting air pollution in Rwanda. (arXiv:2205.10024v1 [stat.ML])
    Air pollution is a major public health problem worldwide although the lack of data is a global issue for most low and middle income countries. Ambient air pollution in the form of fine particulate matter (PM2.5) exceeds the World Health Organization guidelines in Rwanda with a daily average of around 42.6 microgram per meter cube. Monitoring and mitigation strategies require an expensive investment in equipment to collect pollution data. Low-cost sensor technology and machine learning methods have appeared as an alternative solution to get reliable information for decision making. This paper analyzes the trend of air pollution in Rwanda and proposes forecasting models suitable to data collected by a network of low-cost sensors deployed in Rwanda.
    An alternative proof of the vulnerability of retrieval in high intrinsic dimensionality neighborhood. (arXiv:2010.00990v2 [cs.LG] UPDATED)
    This paper investigates the vulnerability of the nearest neighbors search, which is a pivotal tool in data analysis and machine learning. The vulnerability is gauged as the relative amount of perturbation that an attacker needs to add onto a dataset point in order to modify its neighbor rank w.r.t. a query. The statistical distribution of this quantity is derived from simple assumptions. Experiments on six large scale datasets validate this model up to some outliers which are explained in term of violations of the assumptions.
    Understanding Why Generalized Reweighting Does Not Improve Over ERM. (arXiv:2201.12293v3 [cs.LG] UPDATED)
    Empirical risk minimization (ERM) is known in practice to be non-robust to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to the class of GRW approaches.
    Robust Expected Information Gain for Optimal Bayesian Experimental Design Using Ambiguity Sets. (arXiv:2205.09914v1 [stat.ML])
    The ranking of experiments by expected information gain (EIG) in Bayesian experimental design is sensitive to changes in the model's prior distribution, and the approximation of EIG yielded by sampling will have errors similar to the use of a perturbed prior. We define and analyze \emph{robust expected information gain} (REIG), a modification of the objective in EIG maximization by minimizing an affine relaxation of EIG over an ambiguity set of distributions that are close to the original prior in KL-divergence. We show that, when combined with a sampling-based approach to estimating EIG, REIG corresponds to a `log-sum-exp' stabilization of the samples used to estimate EIG, meaning that it can be efficiently implemented in practice. Numerical tests combining REIG with variational nested Monte Carlo (VNMC), adaptive contrastive estimation (ACE) and mutual information neural estimation (MINE) suggest that in practice REIG also compensates for the variability of under-sampled estimators.
    Deep electric field predictions by drift-reduced Braginskii theory with plasma-neutral interactions based upon experimental images of boundary turbulence. (arXiv:2204.11689v1 [physics.plasm-ph] CROSS LISTED)
    We present 2-dimensional turbulent electric field calculations via physics-informed deep learning consistent with (i) drift-reduced Braginskii theory under the framework of an axisymmetric fusion plasma with purely toroidal field and (ii) experimental estimates of the fluctuating electron density and temperature obtained from analysis of gas puff imaging of a discharge on the Alcator C-Mod tokamak. The inclusion of effects from the locally puffed atomic helium on particle and energy sources within the reduced plasma turbulence model are found to strengthen correlations between the electric field and electron pressure. The neutrals are also directly associated with an observed broadening in the distribution of turbulent field amplitudes and increased ${\bf E \times B}$ shearing rates.
    Posterior Refinement Improves Sample Efficiency in Bayesian Neural Networks. (arXiv:2205.10041v1 [cs.LG])
    Monte Carlo (MC) integration is the de facto method for approximating the predictive distribution of Bayesian neural networks (BNNs). But, even with many MC samples, Gaussian-based BNNs could still yield bad predictive performance due to the posterior approximation's error. Meanwhile, alternatives to MC integration tend to be more expensive or biased. In this work, we experimentally show that the key to good MC-approximated predictive distributions is the quality of the approximate posterior itself. However, previous methods for obtaining accurate posterior approximations are expensive and non-trivial to implement. We, therefore, propose to refine Gaussian approximate posteriors with normalizing flows. When applied to last-layer BNNs, it yields a simple \emph{post hoc} method for improving pre-existing parametric approximations. We show that the resulting posterior approximation is competitive with even the gold-standard full-batch Hamiltonian Monte Carlo.
    A Case of Exponential Convergence Rates for SVM. (arXiv:2205.10055v1 [stat.ML])
    Classification is often the first problem described in introductory machine learning classes. Generalization guarantees of classification have historically been offered by Vapnik-Chervonenkis theory. Yet those guarantees are based on intractable algorithms, which has led to the theory of surrogate methods in classification. Guarantees offered by surrogate methods are based on calibration inequalities, which have been shown to be highly sub-optimal under some margin conditions, failing short to capture exponential convergence phenomena. Those "super" fast rates are becoming to be well understood for smooth surrogates, but the picture remains blurry for non-smooth losses such as the hinge loss, associated with the renowned support vector machines. In this paper, we present a simple mechanism to obtain fast convergence rates and we investigate its usage for SVM. In particular, we show that SVM can exhibit exponential convergence rates even without assuming the hard Tsybakov margin condition.
    Triangulation candidates for Bayesian optimization. (arXiv:2112.07457v2 [stat.CO] UPDATED)
    Bayesian optimization involves "inner optimization" over a new-data acquisition criterion which is non-convex/highly multi-modal, may be non-differentiable, or may otherwise thwart local numerical optimizers. In such cases it is common to replace continuous search with a discrete one over random candidates. Here we propose using candidates based on a Delaunay triangulation of the existing input design. We detail the construction of these "tricands" and demonstrate empirically how they outperform both numerically optimized acquisitions and random candidate-based alternatives, and are well-suited for hybrid schemes, on benchmark synthetic and real simulation experiments.
    Speeding up PCA with priming. (arXiv:2109.03709v3 [cs.LG] UPDATED)
    We introduce primed-PCA (pPCA), a two-step algorithm for speeding up the approximation of principal components. This algorithm first runs any approximate-PCA method to get an initial estimate of the principal components (priming), and then applies an exact PCA in the subspace they span. Since this subspace is of small dimension in any practical use, the second step is extremely cheap computationally. Nonetheless, it improves accuracy significantly for a given computational budget across datasets. In this setup, the purpose of the priming is to narrow down the search space, and prepare the data for the second step, an exact calculation. We show formally that pPCA improves upon the priming algorithm under very mild conditions, and we provide experimental validation on both synthetic and real large-scale datasets showing that it systematically translates to improved performance. In our experiments we prime pPCA by several approximate algorithms and report an average speedup by a factor of 7.2 over Oja's rule, and a factor of 10.5 over EigenGame.  ( 2 min )
    Interpretable Personalization via Policy Learning with Linear Decision Boundaries. (arXiv:2003.07545v3 [cs.LG] UPDATED)
    With the rise of the digital economy and an explosion of available information on consumers, effective personalization of offers, goods, and services has become a core business focus for companies to improve revenues and maintain competitive edge. This paper studies the personalization problem through the lens of policy learning, where the goal is to learn a decision-making rule (a policy) that maps from consumer and product characteristics (features) to recommendations (actions) in order to optimize outcomes (rewards). We focus on using available historical data for offline learning with unknown data collection procedure. Importantly, in many business and medical settings, interpretability of a policy is essential. To address these challenges, we study the class of policies with linear decision boundaries and propose learning algorithms using tools from causal inference. We propose several optimization schemes to solve the associated non-convex, non-smooth optimization problem, and find that an adapted Bayesian optimization algorithm is fast and effective. We test our algorithm with extensive simulation studies and apply it to an online marketplace customer purchase dataset, where the learned policy outputs a personalized discount recommendation based on customer and product features in order to maximize gross merchandise value (GMV) for sellers. Our learned policy improves upon the platform's baseline by 88.2\% in net sales revenue, while also providing informative insights on which features are important for the decision-making process, e.g. when "Attribute 2" is large, marginal increase in GMV is low for discounts higher than 10\%. Our findings suggest that the proposed policy learning algorithm provides a promising practical approach for interpretable personalization across a wide range of applications.  ( 2 min )
    Algorithms for Weak Optimal Transport with an Application to Economics. (arXiv:2205.09825v1 [stat.ML])
    The theory of weak optimal transport (WOT), introduced by [Gozlan et al., 2017], generalizes the classic Monge-Kantorovich framework by allowing the transport cost between one point and the points it is matched with to be nonlinear. In the so-called barycentric version of WOT, the cost for transporting a point $x$ only depends on $x$ and on the barycenter of the points it is matched with. This aggregation property of WOT is appealing in machine learning, economics and finance. Yet algorithms to compute WOT have only been developed for the special case of quadratic barycentric WOT, or depend on neural networks with no guarantee on the computed value and matching. The main difficulty lies in the transportation constraints which are costly to project onto. In this paper, we propose to use mirror descent algorithms to solve the primal and dual versions of the WOT problem. We also apply our algorithms to the variant of WOT introduced by [Chon\'e et al., 2022] where mass is distributed from one space to another through unnormalized kernels (WOTUK). We empirically compare the solutions of WOT and WOTUK with classical OT. We illustrate our numerical methods to the economic framework of [Chon\'e and Kramarz, 2021], namely the matching between workers and firms on labor markets.
    Invariance principle of random projection for the norm. (arXiv:2112.00300v2 [math.PR] UPDATED)
    Johnson-Lindenstrauss guarantees certain topological structure is preserved under random projections when project high dimensional deterministic vectors to low dimensional vectors. In this work, we try to understand how random matrix affect norms of random vectors. In particular we prove the distribution of the norm of random vector $X \in \mathbb{R}^n$, whose entries are i.i.d. random variables, is preserved by random projection $S:\mathbb{R}^n \to \mathbb{R}^m$. More precisely, \[ \frac{X^TS^TSX - mn}{\sqrt{\sigma^2 m^2n+2mn^2}} \xrightarrow[\quad m/n\to 0 \quad ]{ m,n\to \infty } \mathcal{N}(0,1) \] We also prove a concentration of the random norm transformed by either random projection or random embedding. Overall, our results showed random matrix has low distortion for the norm of random vectors with i.i.d. entries.
    Nonlinear Initialization Methods for Low-Rank Neural Networks. (arXiv:2202.00834v3 [cs.LG] UPDATED)
    We propose a novel low-rank initialization framework for training low-rank deep neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices. The most successful prior existing approach, spectral initialization, draws a sample from the initialization distribution for the full-rank setting and then optimally approximates the full-rank initialization parameters in the Frobenius norm with a pair of low-rank initialization matrices via singular value decomposition. Our method is inspired by the insight that approximating the function corresponding to each layer is more important than approximating the parameter values. We provably demonstrate that there is a significant gap between these two approaches for ReLU networks, particularly as the desired rank of the approximating weights decreases, or as the dimension of the inputs to the layer increases (the latter point holds when the network width is super-linear in dimension). Along the way, we provide the first provably efficient algorithm for solving the ReLU low-rank approximation problem for fixed parameter rank $r$ -- previously, it was unknown that the problem was computationally tractable to solve even for rank $1$. We also provide a practical algorithm to solve this problem which is no more expensive than the existing spectral initialization approach, and validate our theory by training ResNet and EfficientNet models (He et al., 2016; Tan & Le, 2019) on ImageNet (Russakovsky et al., 2015).
    Semi-self-supervised Automated ICD Coding. (arXiv:2205.10088v1 [cs.CL])
    Clinical Text Notes (CTNs) contain physicians' reasoning process, written in an unstructured free text format, as they examine and interview patients. In recent years, several studies have been published that provide evidence for the utility of machine learning for predicting doctors' diagnoses from CTNs, a task known as ICD coding. Data annotation is time consuming, particularly when a degree of specialization is needed, as is the case for medical data. This paper presents a method of augmenting a sparsely annotated dataset of Icelandic CTNs with a machine-learned imputation in a semi-self-supervised manner. We train a neural network on a small set of annotated CTNs and use it to extract clinical features from a set of un-annotated CTNs. These clinical features consist of answers to about a thousand potential questions that a physician might find the answers to during a consultation of a patient. The features are then used to train a classifier for the diagnosis of certain types of diseases. We report the results of an evaluation of this data augmentation method over three tiers of data availability to the physician. Our data augmentation method shows a significant positive effect which is diminished when clinical features from the examination of the patient and diagnostics are made available. We recommend our method for augmenting scarce datasets for systems that take decisions based on clinical features that do not include examinations or tests.
    A New Central Limit Theorem for the Augmented IPW Estimator: Variance Inflation, Cross-Fit Covariance and Beyond. (arXiv:2205.10198v1 [math.ST])
    Estimation of the average treatment effect (ATE) is a central problem in causal inference. In recent times, inference for the ATE in the presence of high-dimensional covariates has been extensively studied. Among the diverse approaches that have been proposed, augmented inverse probability weighting (AIPW) with cross-fitting has emerged as a popular choice in practice. In this work, we study this cross-fit AIPW estimator under well-specified outcome regression and propensity score models in a high-dimensional regime where the number of features and samples are both large and comparable. Under assumptions on the covariate distribution, we establish a new CLT for the suitably scaled cross-fit AIPW that applies without any sparsity assumptions on the underlying high-dimensional parameters. Our CLT uncovers two crucial phenomena among others: (i) the AIPW exhibits a substantial variance inflation that can be precisely quantified in terms of the signal-to-noise ratio and other problem parameters, (ii) the asymptotic covariance between the pre-cross-fit estimates is non-negligible even on the root-n scale. In fact, these cross-covariances turn out to be negative in our setting. These findings are strikingly different from their classical counterparts. On the technical front, our work utilizes a novel interplay between three distinct tools--approximate message passing theory, the theory of deterministic equivalents, and the leave-one-out approach. We believe our proof techniques should be useful for analyzing other two-stage estimators in this high-dimensional regime. Finally, we complement our theoretical results with simulations that demonstrate both the finite sample efficacy of our CLT and its robustness to our assumptions.
    Instagram Fake and Automated Account Detection. (arXiv:1910.03090v3 [cs.IR] UPDATED)
    Fake engagement is one of the significant problems in Online Social Networks (OSNs) which is used to increase the popularity of an account in an inorganic manner. The detection of fake engagement is crucial because it leads to loss of money for businesses, wrong audience targeting in advertising, wrong product predictions systems, and unhealthy social network environment. This study is related with the detection of fake and automated accounts which leads to fake engagement on Instagram. Prior to this work, there were no publicly available dataset for fake and automated accounts. For this purpose, two datasets have been published for the detection of fake and automated accounts. For the detection of these accounts, machine learning algorithms like Naive Bayes, Logistic Regression, Support Vector Machines and Neural Networks are applied. Additionally, for the detection of automated accounts, cost sensitive genetic algorithm is proposed to handle the unnatural bias in the dataset. To deal with the unevenness problem in the fake dataset, Smote-nc algorithm is implemented. For the automated and fake account detection datasets, 86% and 96% classification accuracies are obtained, respectively.
    Remember and Forget Experience Replay for Multi-Agent Reinforcement Learning. (arXiv:2203.13319v2 [cs.LG] UPDATED)
    We present the extension of the Remember and Forget for Experience Replay (ReF-ER) algorithm to Multi-Agent Reinforcement Learning (MARL). {ReF-ER} was shown to outperform state of the art algorithms for continuous control in problems ranging from the OpenAI Gym to complex fluid flows. In MARL, the dependencies between the agents are included in the state-value estimator and the environment dynamics are modeled via the importance weights used by ReF-ER. In collaborative environments, we find the best performance when the value is estimated using individual rewards and we ignore the effects of other actions on the transition map. We benchmark the performance of ReF-ER MARL on the Stanford Intelligent Systems Laboratory (SISL) environments. We find that employing a single feed-forward neural network for the policy and the value function in ReF-ER MARL, outperforms state of the art algorithms that rely on complex neural network architectures.
    On Calibration of Ensemble-Based Credal Predictors. (arXiv:2205.10082v1 [stat.ML])
    In recent years, several classification methods that intend to quantify epistemic uncertainty have been proposed, either by producing predictions in the form of second-order distributions or sets of probability distributions. In this work, we focus on the latter, also called credal predictors, and address the question of how to evaluate them: What does it mean that a credal predictor represents epistemic uncertainty in a faithful manner? To answer this question, we refer to the notion of calibration of probabilistic predictors and extend it to credal predictors. Broadly speaking, we call a credal predictor calibrated if it returns sets that cover the true conditional probability distribution. To verify this property for the important case of ensemble-based credal predictors, we propose a novel nonparametric calibration test that generalizes an existing test for probabilistic predictors to the case of credal predictors. Making use of this test, we empirically show that credal predictors based on deep neural networks are often not well calibrated.
    Why GANs are overkill for NLP. (arXiv:2205.09838v1 [cs.LG])
    This work offers a novel theoretical perspective on why, despite numerous attempts, adversarial approaches to generative modeling (e.g., GANs) have not been as popular for certain generation tasks, particularly sequential tasks such as Natural Language Generation, as they have in others, such as Computer Vision. In particular, on sequential data such as text, maximum-likelihood approaches are significantly more utilized than GANs. We show that, while it may seem that maximizing likelihood is inherently different than minimizing distinguishability, this distinction is largely artificial and only holds for limited models. We argue that minimizing KL-divergence (i.e., maximizing likelihood) is a more efficient approach to effectively minimizing the same distinguishability criteria that adversarial models seek to optimize. Reductions show that minimizing distinguishability can be seen as simply boosting likelihood for certain families of models including n-gram models and neural networks with a softmax output layer. To achieve a full polynomial-time reduction, a novel next-token distinguishability model is considered.  ( 2 min )
    Conformal Prediction with Temporal Quantile Adjustments. (arXiv:2205.09940v1 [stat.ML])
    We develop Temporal Quantile Adjustment (TQA), a general method to construct efficient and valid prediction intervals (PIs) for regression on cross-sectional time series data. Such data is common in many domains, including econometrics and healthcare. A canonical example in healthcare is predicting patient outcomes using physiological time-series data, where a population of patients composes a cross-section. Reliable PI estimators in this setting must address two distinct notions of coverage: cross-sectional coverage across a cross-sectional slice, and longitudinal coverage along the temporal dimension for each time series. Recent works have explored adapting Conformal Prediction (CP) to obtain PIs in the time series context. However, none handles both notions of coverage simultaneously. CP methods typically query a pre-specified quantile from the distribution of nonconformity scores on a calibration set. TQA adjusts the quantile to query in CP at each time $t$, accounting for both cross-sectional and longitudinal coverage in a theoretically-grounded manner. The post-hoc nature of TQA facilitates its use as a general wrapper around any time series regression model. We validate TQA's performance through extensive experimentation: TQA generally obtains efficient PIs and improves longitudinal coverage while preserving cross-sectional coverage.
    Bounding the Effects of Continuous Treatments for Hidden Confounders. (arXiv:2204.11206v2 [stat.ME] UPDATED)
    Observational studies often seek to infer the causal effect of a treatment even though both the assigned treatment and the outcome depend on other confounding variables. An effective strategy for dealing with confounders is to estimate a propensity model that corrects for the relationship between covariates and assigned treatment. Unfortunately, the confounding variables themselves are not always observed, in which case we can only bound the propensity, and therefore bound the magnitude of causal effects. In many important cases, like administering a dose of some medicine, the possible treatments belong to a continuum. Sensitivity models, which are required to tie the true propensity to something that can be estimated, have been explored for binary treatments. We propose one for continuous treatments. We develop a framework to compute ignorance intervals on the partially identified dose-response curves, enabling us to quantify the susceptibility of an inference to hidden confounders. We show with simulations and three real-world observational studies that our approach can give non-trivial bounds on causal effects from continuous treatments in the presence of hidden confounders.
    Almost exact recovery in noisy semi-supervised learning. (arXiv:2007.14717v3 [cs.LG] UPDATED)
    Graph-based semi-supervised learning methods combine the graph structure and labeled data to classify unlabeled data. In this work, we study the effect of a noisy oracle on classification. In particular, we derive the Maximum A Posteriori (MAP) estimator for clustering a Degree Corrected Stochastic Block Model (DC-SBM) when a noisy oracle reveals a fraction of the labels. We then propose an algorithm derived from a continuous relaxation of the MAP, and we establish its consistency. Numerical experiments show that our approach achieves promising performance on synthetic and real data sets, even in the case of very noisy labeled data.
    Bootstrapping the error of Oja's algorithm. (arXiv:2106.14857v2 [math.ST] UPDATED)
    We consider the problem of quantifying uncertainty for the estimation error of the leading eigenvector from Oja's algorithm for streaming principal component analysis, where the data are generated IID from some unknown distribution. By combining classical tools from the U-statistics literature with recent results on high-dimensional central limit theorems for quadratic forms of random vectors and concentration of matrix products, we establish a weighted $\chi^2$ approximation result for the $\sin^2$ error between the population eigenvector and the output of Oja's algorithm. Since estimating the covariance matrix associated with the approximating distribution requires knowledge of unknown model parameters, we propose a multiplier bootstrap algorithm that may be updated in an online manner. We establish conditions under which the bootstrap distribution is close to the corresponding sampling distribution with high probability, thereby establishing the bootstrap as a consistent inferential method in an appropriate asymptotic regime.
    Deep Learning Methods for Proximal Inference via Maximum Moment Restriction. (arXiv:2205.09824v1 [stat.ML])
    The No Unmeasured Confounding Assumption is widely used to identify causal effects in observational studies. Recent work on proximal inference has provided alternative identification results that succeed even in the presence of unobserved confounders, provided that one has measured a sufficiently rich set of proxy variables, satisfying specific structural conditions. However, proximal inference requires solving an ill-posed integral equation. Previous approaches have used a variety of machine learning techniques to estimate a solution to this integral equation, commonly referred to as the bridge function. However, prior work has often been limited by relying on pre-specified kernel functions, which are not data adaptive and struggle to scale to large datasets. In this work, we introduce a flexible and scalable method based on a deep neural network to estimate causal effects in the presence of unmeasured confounding using proximal inference. Our method achieves state of the art performance on two well-established proximal inference benchmarks. Finally, we provide theoretical consistency guarantees for our method.
    The Unreasonable Effectiveness of Deep Evidential Regression. (arXiv:2205.10060v1 [cs.LG])
    There is a significant need for principled uncertainty reasoning in machine learning systems as they are increasingly deployed in safety-critical domains. A new approach with uncertainty-aware regression-based neural networks (NNs), based on learning evidential distributions for aleatoric and epistemic uncertainties, shows promise over traditional deterministic methods and typical Bayesian NNs, notably with the capabilities to disentangle aleatoric and epistemic uncertainties. Despite some empirical success of Deep Evidential Regression (DER), there are important gaps in the mathematical foundation that raise the question of why the proposed technique seemingly works. We detail the theoretical shortcomings and analyze the performance on synthetic and real-world data sets, showing that Deep Evidential Regression is a heuristic rather than an exact uncertainty quantification. We go on to propose corrections and redefinitions of how aleatoric and epistemic uncertainties should be extracted from NNs.
    A General Framework for quantifying Aleatoric and Epistemic uncertainty in Graph Neural Networks. (arXiv:2205.09968v1 [cs.LG])
    Graph Neural Networks (GNN) provide a powerful framework that elegantly integrates Graph theory with Machine learning for modeling and analysis of networked data. We consider the problem of quantifying the uncertainty in predictions of GNN stemming from modeling errors and measurement uncertainty. We consider aleatoric uncertainty in the form of probabilistic links and noise in feature vector of nodes, while epistemic uncertainty is incorporated via a probability distribution over the model parameters. We propose a unified approach to treat both sources of uncertainty in a Bayesian framework, where Assumed Density Filtering is used to quantify aleatoric uncertainty and Monte Carlo dropout captures uncertainty in model parameters. Finally, the two sources of uncertainty are aggregated to estimate the total uncertainty in predictions of a GNN. Results in the real-world datasets demonstrate that the Bayesian model performs at par with a frequentist model and provides additional information about predictions uncertainty that are sensitive to uncertainties in the data and model.
    Memorization and Optimization in Deep Neural Networks with Minimum Over-parameterization. (arXiv:2205.10217v1 [stat.ML])
    The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks. A line of work has studied the NTK spectrum for two-layer and deep networks with at least a layer with $\Omega(N)$ neurons, $N$ being the number of training samples. Furthermore, there is increasing evidence suggesting that deep networks with sub-linear layer widths are powerful memorizers and optimizers, as long as the number of parameters exceeds the number of samples. Thus, a natural open question is whether the NTK is well conditioned in such a challenging sub-linear setup. In this paper, we answer this question in the affirmative. Our key technical contribution is a lower bound on the smallest NTK eigenvalue for deep networks with the minimum possible over-parameterization: the number of parameters is roughly $\Omega(N)$ and, hence, the number of neurons is as little as $\Omega(\sqrt{N})$. To showcase the applicability of our NTK bounds, we provide two results concerning memorization capacity and optimization guarantees for gradient descent training.
    Generating Semantic Adversarial Examples via Feature Manipulation. (arXiv:2001.02297v2 [cs.LG] UPDATED)
    The vulnerability of deep neural networks to adversarial attacks has been widely demonstrated (e.g., adversarial example attacks). Traditional attacks perform unstructured pixel-wise perturbation to fool the classifier. An alternative approach is to have perturbations in the latent space. However, such perturbations are hard to control due to the lack of interpretability and disentanglement. In this paper, we propose a more practical adversarial attack by designing structured perturbation with semantic meanings. Our proposed technique manipulates the semantic attributes of images via the disentangled latent codes. The intuition behind our technique is that images in similar domains have some commonly shared but theme-independent semantic attributes, e.g. thickness of lines in handwritten digits, that can be bidirectionally mapped to disentangled latent codes. We generate adversarial perturbation by manipulating a single or a combination of these latent codes and propose two unsupervised semantic manipulation approaches: vector-based disentangled representation and feature map-based disentangled representation, in terms of the complexity of the latent codes and smoothness of the reconstructed images. We conduct extensive experimental evaluations on real-world image data to demonstrate the power of our attacks for black-box classifiers. We further demonstrate the existence of a universal, image-agnostic semantic adversarial example.  ( 2 min )
    Deep reinforcement learning under signal temporal logic constraints using Lagrangian relaxation. (arXiv:2201.08504v3 [stat.ML] UPDATED)
    Deep reinforcement learning (DRL) has attracted much attention as an approach to solve sequential decision making problems without mathematical models of systems or environments. In general, a constraint may be imposed on a decision making. In this study, we consider the optimal decision making problems with constraints to complete temporal high-level tasks in the continuous state-action domain. We describe the constraints using signal temporal logic (STL), which is useful for time sensitive control tasks since it can specify continuous signals within a bounded time interval. To deal with the STL constraints, we introduce an extended constrained Markov decision process (CMDP), which is called a $\tau$-CMDP. We formulate the STL constrained optimal decision making problem as the $\tau$-CMDP and propose a two-phase constrained DRL algorithm using the Lagrangian relaxation method. Through simulations, we also demonstrate the learning performance of the proposed algorithm.  ( 2 min )
    Mitigating Statistical Bias within Differentially Private Synthetic Data. (arXiv:2108.10934v3 [stat.ML] UPDATED)
    Increasing interest in privacy-preserving machine learning has led to new and evolved approaches for generating private synthetic data from undisclosed real data. However, mechanisms of privacy preservation can significantly reduce the utility of synthetic data, which in turn impacts downstream tasks such as learning predictive models or inference. We propose several re-weighting strategies using privatised likelihood ratios that not only mitigate statistical bias of downstream estimators but also have general applicability to differentially private generative models. Through large-scale empirical evaluation, we show that private importance weighting provides simple and effective privacy-compliant augmentation for general applications of synthetic data.  ( 2 min )
    EXODUS: Stable and Efficient Training of Spiking Neural Networks. (arXiv:2205.10242v1 [cs.NE])
    Spiking Neural Networks (SNNs) are gaining significant traction in machine learning tasks where energy-efficiency is of utmost importance. Training such networks using the state-of-the-art back-propagation through time (BPTT) is, however, very time-consuming. Previous work by Shrestha and Orchard [2018] employs an efficient GPU-accelerated back-propagation algorithm called SLAYER, which speeds up training considerably. SLAYER, however, does not take into account the neuron reset mechanism while computing the gradients, which we argue to be the source of numerical instability. To counteract this, SLAYER introduces a gradient scale hyperparameter across layers, which needs manual tuning. In this paper, (i) we modify SLAYER and design an algorithm called EXODUS, that accounts for the neuron reset mechanism and applies the Implicit Function Theorem (IFT) to calculate the correct gradients (equivalent to those computed by BPTT), (ii) we eliminate the need for ad-hoc scaling of gradients, thus, reducing the training complexity tremendously, (iii) we demonstrate, via computer simulations, that EXODUS is numerically stable and achieves a comparable or better performance than SLAYER especially in various tasks with SNNs that rely on temporal features. Our code is available at https://github.com/synsense/sinabs-exodus.  ( 2 min )
    B-cos Networks: Alignment is All We Need for Interpretability. (arXiv:2205.10268v1 [cs.CV])
    We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training. For this, we propose to replace the linear transforms in DNNs by our B-cos transform. As we show, a sequence (network) of such transforms induces a single linear transform that faithfully summarises the full model computations. Moreover, the B-cos transform introduces alignment pressure on the weights during optimisation. As a result, those induced linear transforms become highly interpretable and align with task-relevant features. Importantly, the B-cos transform is designed to be compatible with existing architectures and we show that it can easily be integrated into common models such as VGGs, ResNets, InceptionNets, and DenseNets, whilst maintaining similar performance on ImageNet. The resulting explanations are of high visual quality and perform well under quantitative metrics for interpretability. Code available at https://www.github.com/moboehle/B-cos.
    Optimizing the Communication-Accuracy Trade-off in Federated Learning with Rate-Distortion Theory. (arXiv:2201.02664v3 [cs.LG] UPDATED)
    A significant bottleneck in federated learning (FL) is the network communication cost of sending model updates from client devices to the central server. We present a comprehensive empirical study of the statistics of model updates in FL, as well as the role and benefits of various compression techniques. Motivated by these observations, we propose a novel method to reduce the average communication cost, which is near-optimal in many use cases, and outperforms Top-K, DRIVE, 3LC and QSGD on Stack Overflow next-word prediction, a realistic and challenging FL benchmark. This is achieved by examining the problem using rate-distortion theory, and proposing distortion as a reliable proxy for model accuracy. Distortion can be more effectively used for optimizing the trade-off between model performance and communication cost across clients. We demonstrate empirically that in spite of the non-i.i.d. nature of federated learning, the rate-distortion frontier is consistent across datasets, optimizers, clients and training rounds.
    Sparse Infinite Random Feature Latent Variable Modeling. (arXiv:2205.09909v1 [stat.ML])
    We propose a non-linear, Bayesian non-parametric latent variable model where the latent space is assumed to be sparse and infinite dimensional a priori using an Indian buffet process prior. A posteriori, the number of instantiated dimensions in the latent space is guaranteed to be finite. The purpose of placing the Indian buffet process on the latent variables is to: 1.) Automatically and probabilistically select the number of latent dimensions. 2.) Impose sparsity in the latent space, where the Indian buffet process will select which elements are exactly zero. Our proposed model allows for sparse, non-linear latent variable modeling where the number of latent dimensions is selected automatically. Inference is made tractable using the random Fourier approximation and we can easily implement posterior inference through Markov chain Monte Carlo sampling. This approach is amenable to many observation models beyond the Gaussian setting. We demonstrate the utility of our method on a variety of synthetic, biological and text datasets and show that we can obtain superior test set performance compared to previous latent variable models.
    Lifelong Neural Predictive Coding: Learning Cumulatively Online without Forgetting. (arXiv:1905.10696v3 [cs.LG] UPDATED)
    In lifelong learning systems based on artificial neural networks, one of the biggest obstacles is the inability to retain old knowledge as new information is encountered. This phenomenon is known as catastrophic forgetting. In this paper, we propose a new kind of connectionist architecture, the Sequential Neural Coding Network, that is robust to forgetting when learning from streams of data points and, unlike networks of today, does not learn via the popular back-propagation of errors. Grounded in the neurocognitive theory of predictive processing, our model adapts synapses in a biologically-plausible fashion while another neural system learns to direct and control this cortex-like structure by mimicking some of task-executive control functionality of the basal ganglia. In our experiments, we demonstrate that our self-organizing system experiences significantly less forgetting compared to standard neural models, outperforming a swath of previously proposed methods, including rehearsal/data buffer-based methods, on both standard (SplitMNIST, Split Fashion MNIST, etc.) and custom benchmarks even though it is trained in a stream-like fashion. Our work offers evidence that emulating mechanisms in real neuronal systems, e.g., local learning, lateral competition, can yield new directions for tackling the grand challenge of lifelong machine learning.
    On Algorithmic Stability in Unsupervised Representation Learning. (arXiv:2106.05238v3 [cs.LG] UPDATED)
    In this paper, we investigate the algorithmic stability of unsupervised representation learning with deep generative models, as a function of repeated re-training on the same input data. Algorithms for learning low dimensional linear representations -- for example principal components analysis (PCA), or linear independent components analysis (ICA) -- come with guarantees that they will always reveal the same latent representations (perhaps up to an arbitrary rotation or permutation). Unfortunately, for non-linear representation learning, such as in a variational auto-encoder (VAE) model trained by stochastic gradient descent, we have no such guarantees. Recent work on identifiability in non-linear ICA have introduced a family of deep generative models that have identifiable latent representations, achieved by conditioning on side information (e.g. informative labels). We empirically evaluate the stability of these models under repeated re-estimation of parameters, and compare them to both standard VAEs and deep generative models which learn to cluster in their latent space. Surprisingly, we discover side information is not necessary for algorithmic stability: using standard quantitative measures of identifiability, we find deep generative models with latent clusterings are empirically identifiable to the same degree as models which rely on auxiliary labels. We relate these results to the possibility of identifiable non-linear ICA.
    Sequentially learning the topological ordering of causal directed acyclic graphs with likelihood ratio scores. (arXiv:2202.01748v2 [stat.ME] UPDATED)
    Causal discovery, the learning of causality in a data mining scenario, has been of strong scientific and theoretical interest as a starting point to identify "what causes what?" Contingent on assumptions and a proper learning algorithm, it is sometimes possible to identify and accurately estimate a causal directed acyclic graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions. The focus of this paper is in highlighting the identifiability and estimation of DAGs with general error distributions through a general sequential sorting procedure that orders variables one at a time, starting at root nodes, followed by children of the root nodes, and so on until completion. We demonstrate a novel application of this general approach to estimate the topological ordering of a DAG. At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering. The computational complexity of our algorithm on a p-node problem is O(pd), where d is the maximum neighborhood size. Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying DAG. We provide extensive numerical evidence to demonstrate that this sequential procedure scales to possibly thousands of nodes and works well for high-dimensional data. We accompany these numerical experiments with an application to a single-cell gene expression dataset.
    The Fairness of Credit Scoring Models. (arXiv:2205.10200v1 [stat.ML])
    In credit markets, screening algorithms aim to discriminate between good-type and bad-type borrowers. However, when doing so, they also often discriminate between individuals sharing a protected attribute (e.g. gender, age, racial origin) and the rest of the population. In this paper, we show how (1) to test whether there exists a statistically significant difference between protected and unprotected groups, which we call lack of fairness and (2) to identify the variables that cause the lack of fairness. We then use these variables to optimize the fairness-performance trade-off. Our framework provides guidance on how algorithmic fairness can be monitored by lenders, controlled by their regulators, and improved for the benefit of protected groups.
    Counterfactual Temporal Point Processes. (arXiv:2111.07603v2 [cs.LG] UPDATED)
    Machine learning models based on temporal point processes are the state of the art in a wide variety of applications involving discrete events in continuous time. However, these models lack the ability to answer counterfactual questions, which are increasingly relevant as these models are being used to inform targeted interventions. In this work, our goal is to fill this gap. To this end, we first develop a causal model of thinning for temporal point processes that builds upon the Gumbel-Max structural causal model. This model satisfies a desirable counterfactual monotonicity condition, which is sufficient to identify counterfactual dynamics in the process of thinning. Then, given an observed realization of a temporal point process with a given intensity function, we develop a sampling algorithm that uses the above causal model of thinning and the superposition theorem to simulate counterfactual realizations of the temporal point process under a given alternative intensity function. Simulation experiments using synthetic and real epidemiological data show that the counterfactual realizations provided by our algorithm may give valuable insights to enhance targeted interventions.
    The Bayesian Context Trees State Space Model: Interpretable mixture models for time series. (arXiv:2106.03023v3 [stat.ME] UPDATED)
    A general hierarchical Bayesian framework is introduced for mixture modelling of real-valued time series, including a collection of effective tools for learning and inference. At the top level, a discrete context (or `state') is extracted for each sample, consisting of a discretised version of some of the most recent observations preceding it. The set of all relevant contexts are represented as a discrete context tree. At the bottom level, a different real-valued time series model is associated with each context (i.e., with each state). This defines a very general framework that can be used in conjunction with any existing model class to build flexible and interpretable mixture models. We introduce algorithms that allow for efficient, exact Bayesian inference; in particular, the maximum a posteriori probability (MAP) model, including the relevant MAP context tree, can be identified exactly. These algorithms can be updated sequentially, facilitating efficient online forecasting. The utility of the general framework is illustrated in detail when autoregressive (AR) models are used at the bottom level, resulting in a nonlinear AR mixture model. Our methods are found to outperform several state-of-the-art techniques on both simulated and real-world data from economics and finance, both in terms of forecasting accuracy and computational requirements.
    Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome. (arXiv:2205.09906v1 [stat.ML])
    Data augmentation plays a key role in modern machine learning pipelines. While numerous augmentation strategies have been studied in the context of computer vision and natural language processing, less is known for other data modalities. Our work extends the success of data augmentation to compositional data, i.e., simplex-valued data, which is of particular interest in the context of the human microbiome. Drawing on key principles from compositional data analysis, such as the Aitchison geometry of the simplex and subcompositions, we define novel augmentation strategies for this data modality. Incorporating our data augmentations into standard supervised learning pipelines results in consistent performance gains across a wide range of standard benchmark datasets. In particular, we set a new state-of-the-art for key disease prediction tasks including colorectal cancer, type 2 diabetes, and Crohn's disease. In addition, our data augmentations enable us to define a novel contrastive learning model, which improves on previous representation learning approaches for microbiome compositional data. Our code is available at https://github.com/cunningham-lab/AugCoDa.
    Mean-Field Analysis of Two-Layer Neural Networks: Global Optimality with Linear Convergence Rates. (arXiv:2205.09860v1 [cs.LG])
    We consider optimizing two-layer neural networks in the mean-field regime where the learning dynamics of network weights can be approximated by the evolution in the space of probability measures over the weight parameters associated with the neurons. The mean-field regime is a theoretically attractive alternative to the NTK (lazy training) regime which is only restricted locally in the so-called neural tangent kernel space around specialized initializations. Several prior works (\cite{mei2018mean, chizat2018global}) establish the asymptotic global optimality of the mean-field regime, but it is still challenging to obtain a quantitative convergence rate due to the complicated nonlinearity of the training dynamics. This work establishes a new linear convergence result for two-layer neural networks trained by continuous-time noisy gradient descent in the mean-field regime. Our result relies on a novelty logarithmic Sobolev inequality for two-layer neural networks, and uniform upper bounds on the logarithmic Sobolev constants for a family of measures determined by the evolving distribution of hidden neurons.
    Graph Neural Networks Are More Powerful Than we Think. (arXiv:2205.09801v1 [cs.LG])
    Graph Neural Networks (GNNs) are powerful convolutional architectures that have shown remarkable performance in various node-level and graph-level tasks. Despite their success, the common belief is that the expressive power of GNNs is limited and that they are at most as discriminative as the Weisfeiler-Lehman (WL) algorithm. In this paper we argue the opposite and show that the WL algorithm is the upper bound only when the input to the GNN is the vector of all ones. In this direction, we derive an alternative analysis that employs linear algebraic tools and characterize the representational power of GNNs with respect to the eigenvalue decomposition of the graph operators. We show that GNNs can distinguish between any graphs that differ in at least one eigenvalue and design simple GNN architectures that are provably more expressive than the WL algorithm. Thorough experimental analysis on graph isomorphism and graph classification datasets corroborates our theoretical results and demonstrates the effectiveness of the proposed architectures.
    What's the Harm? Sharp Bounds on the Fraction Negatively Affected by Treatment. (arXiv:2205.10327v1 [stat.ME])
    The fundamental problem of causal inference -- that we never observe counterfactuals -- prevents us from identifying how many might be negatively affected by a proposed intervention. If, in an A/B test, half of users click (or buy, or watch, or renew, etc.), whether exposed to the standard experience A or a new one B, hypothetically it could be because the change affects no one, because the change positively affects half the user population to go from no-click to click while negatively affecting the other half, or something in between. While unknowable, this impact is clearly of material importance to the decision to implement a change or not, whether due to fairness, long-term, systemic, or operational considerations. We therefore derive the tightest-possible (i.e., sharp) bounds on the fraction negatively affected (and other related estimands) given data with only factual observations, whether experimental or observational. Naturally, the more we can stratify individuals by observable covariates, the tighter the sharp bounds. Since these bounds involve unknown functions that must be learned from data, we develop a robust inference algorithm that is efficient almost regardless of how and how fast these functions are learned, remains consistent when some are mislearned, and still gives valid conservative bounds when most are mislearned. Our methodology altogether therefore strongly supports credible conclusions: it avoids spuriously point-identifying this unknowable impact, focusing on the best bounds instead, and it permits exceedingly robust inference on these. We demonstrate our method in simulation studies and in a case study of career counseling for the unemployed.
    Causal Discovery and Injection for Feed-Forward Neural Networks. (arXiv:2205.09787v1 [cs.LG])
    Neural networks have proven to be effective at solving a wide range of problems but it is often unclear whether they learn any meaningful causal relationship: this poses a problem for the robustness of neural network models and their use for high-stakes decisions. We propose a novel method overcoming this issue by injecting knowledge in the form of (possibly partial) causal graphs into feed-forward neural networks, so that the learnt model is guaranteed to conform to the graph, hence adhering to expert knowledge. This knowledge may be given up-front or during the learning process, to improve the model through human-AI collaboration. We apply our method to synthetic and real (tabular) data showing that it is robust against noise and can improve causal discovery and prediction performance in low data regimes.
    Breaking the $\sqrt{T}$ Barrier: Instance-Independent Logarithmic Regret in Stochastic Contextual Linear Bandits. (arXiv:2205.09899v1 [stat.ML])
    We prove an instance independent (poly) logarithmic regret for stochastic contextual bandits with linear payoff. Previously, in \cite{chu2011contextual}, a lower bound of $\mathcal{O}(\sqrt{T})$ is shown for the contextual linear bandit problem with arbitrary (adversarily chosen) contexts. In this paper, we show that stochastic contexts indeed help to reduce the regret from $\sqrt{T}$ to $\polylog(T)$. We propose Low Regret Stochastic Contextual Bandits (\texttt{LR-SCB}), which takes advantage of the stochastic contexts and performs parameter estimation (in $\ell_2$ norm) and regret minimization simultaneously. \texttt{LR-SCB} works in epochs, where the parameter estimation of the previous epoch is used to reduce the regret of the current epoch. The (poly) logarithmic regret of \texttt{LR-SCB} stems from two crucial facts: (a) the application of a norm adaptive algorithm to exploit the parameter estimation and (b) an analysis of the shifted linear contextual bandit algorithm, showing that shifting results in increasing regret. We have also shown experimentally that stochastic contexts indeed incurs a regret that scales with $\polylog(T)$.  ( 2 min )

  • Open

    What do you think of this approach to help avoid overfitting? (Playing with episode length, start and end)
    Hi all, I'm working on applying RL to a time series environment. I have a limited (and relatively small, only ~30K rows) dataset for the agent to work through, and the data cannot be simulated. So basically, agent takes a step, and the new observations/state is determined by the next row of the tabular dataset. I have reached a point of overfitting - in-sample performance of the RL agent continues increasing, but OOS performance is decreasing. I assume a big part of the reason for this is that the agent is somehow "memorizing" its training data. Here's the thing. At the moment, I am using ALL of the training data as one episode (so once it completes one episode having gone through ALL of the data, it restarts). What if I made an episode length equal to, say, 1/10th of my data and then I start each episode at a random point in the time series? I am thinking that this way, I can have some kind of "randomness" in my otherwise deterministic environment, and perhaps this way I can force the agent to not "memorize" the training data? I am new to RL, so any feedback on this general idea would be greatly appreciated! ​ EDIT: I have just tried this out, and while it's hard to tell (because of random fluctuations), using this technique DOES appear to have positive effect on the out-of-sample performance! Again though, thoughts, ideas and feedback greatly appreciated. submitted by /u/VladimirB-98 [link] [comments]  ( 1 min )
    Need help with PettingZoo
    Hello everyone, I am working on a custom environment using PettingZoo for multi agent reinforcement learning. I managed to finish the environment, however now I don't know how to create the model. I also trained to create a model using stable_baselines3 PPO, however i get the error "AttributeError: 'functools._lru_cache_wrapper' object has no attribute 'shape'". I tried to follow the example from here, but I cannot install supersuit with pip on windows 10 and I am not sure if it is needed. Any help, be it examples or advice, will be very much appreciated. submitted by /u/Iltavil [link] [comments]  ( 1 min )
    can reinforcement learning be used for unlabled time series data clustering?
    submitted by /u/Affectionate_Worth43 [link] [comments]  ( 1 min )
    How should one interpret these PPO diagnostic training plots?
    ​ https://preview.redd.it/9in8206mu0191.png?width=1453&format=png&auto=webp&s=e221eead27f4e781a13e34586bc3ad87d13e7810 So here I have four diagnostic plots for PPO training on a Gym CustomEnv. I have many questions regarding how to interpret them. Although, if there is anything you guys think is interesting/insightful regarding these graphs I would love to hear them. Also, it might be useful to know that the training was indeed successful for this run, and the mean episode reward was (more or less) consistently improving. 1) What does an increasing clip fraction indicate? 2) What does an increasing KL divergence indicate? 3) Why the policy gradient loss go above 0? Wouldn't this mean that the policy should be getting worse? In this case the policy continues to improve even after getting this positive loss. 4) Same as question 3 but for entropy loss. Any help whatsoever will be great. Im quite at a loss. Thanks. submitted by /u/C_BearHill [link] [comments]  ( 1 min )
  • Open

    [D] matrix profile distance measure characterization
    If there are various types of distances measures for time series, such as Euclidean, DTW, and shape-based ones, how can we characterize the matrix profile distance measure? Profiling one? submitted by /u/jiii95 [link] [comments]
    How to create a basic version of DALL-E from scratch? [P]
    I've been searching online for tutorial on how to create a text-to-image generator algorithm (a very basic version of DALL-E). I'd like to create the algorithm line by line rather than just copy some code from online & just train that on my images. Basically, I'm trying to learn how these sorts of algorithms work by coding one from scratch. So, does anyone know of any video or online article tutorials for creating and training a basic text-to-image generator with python? Thanks! submitted by /u/Special_Treacle4452 [link] [comments]  ( 1 min )
    [P] Gradio Demo for "Story and Video Generation" using GPT-J, Latent Diffusion, and FILM
    submitted by /u/Shikanomiya [link] [comments]
    This is how you can turn yourself into an immortal machine. [N]
    submitted by /u/Defiant_Swann [link] [comments]
    [r] How to train a neural network with a loss function based on cumulative AUC of multiple inputs
    Hello, I have almost no experience with neural networks, besides doing a few examples in books. Basically the problem is I have 1000 cases of which 100 are true and 900 are false. Each case has a score associated with it composed of multiple weighted sub-scores (Overall Score = W1 * Score 1 + W2 * Score 2 ... + WN * Score N) The cases are then sorted on their overall score and a ROC curve is generated, by tracking the running percent of positives identified vs percent of negatives identified as you follow through the rank ordered list of scores. I want to create a single set of input weights for sub scores which when applied to all test cases produces the maximal AUC. Any input is appreciated as I have no idea where to start. Thanks! submitted by /u/Fckcensorship25 [link] [comments]  ( 1 min )
    [D] Cheaper gpu cloud alternative to gradient (paperspace)
    Are there any cheaper subscription models like gradient? currently i pay 10 dollars a month with access to rtx 5000 or p5000 for 6 hours at a time. after which you will have to restart the instance, but the data is saved. are there any better alternatives to this? Also, I'm student in case there is a student discount. submitted by /u/Knightron2525 [link] [comments]  ( 1 min )
    [D] Machine Learning - WAYR (What Are You Reading) - Week 138
    This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 131-140 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 131 Week 2 Week 1…  ( 1 min )
    [P] PyTorch M1 GPU benchmark update including M1 Pro, M1 Max, and M1 Ultra after fixing the memory leak
    If someone is curious, I updated the benchmarks after the PyTorch team fixed the memory leak in the latest nightly release May 21->22. The results are quite improved: ​ https://preview.redd.it/5dkat9hoi3191.png?width=2637&format=png&auto=webp&s=dc42ee03167dd3aefbd0319061994bfc2ff24dab For a more detailed write-up please see https://sebastianraschka.com/blog/2022/pytorch-m1-gpu.html submitted by /u/seraschka [link] [comments]  ( 1 min )
    [D] Generating fake research papers with GPT-3
    Hi! I wanted to generate a fake research papers with GPT-3. I'm not sure how to do it in a cost-efficient way: should I fine tune the model with custom examples? Since I am looking to generate the entire paper, I'm not sure how to maintain the context in different parts of the research paper (abstract, introduction, ....., conclusion). I was thinking of two paths: (1) Take an existing research paper. Feed it into GPT-3 to reword the entire research paper. (2) Take the title of an existing research paper. Make GPT-3 generate the abstract. Use the abstract to generate the introduction. Use the introduction to generate the description, and so on. I was planning to fine tune the GPT-3 model by breaking down existing research papers into (prompt: title, prompt: rest of the paper) Is there a better way to generate entire fake research papers? submitted by /u/mimeticaware [link] [comments]  ( 1 min )
    [D] Explainable AI
    Does anyone know papers, where experts are asked to interpret the results of explanation algorithms like SHAP and LIME? In the best case on timeseries data, where the explanation algorithm highlights which time points lead to decisions of models. submitted by /u/Bananymous97 [link] [comments]  ( 1 min )
    [R] HYDRA is the first spatial perception engine that builds a 3D scene graph (geometry and semantics) from sensor data in realtime
    submitted by /u/SpatialComputing [link] [comments]  ( 2 min )
    [D] EMNLP 2022 and ARR June 1
    Hi there! We are planning to submit a paper to the June 1 ARR deadline and commit to EMNLP 2022 later. I wonder if reviews will be out before July 24, the ARR-to-EMNLP commitment deadline. Are you submitting to EMNLP directly or via ARR this year? ​ Thanks in advance! Wishing you all the luck with paper submissions! submitted by /u/bunsenfeng [link] [comments]  ( 1 min )
    [D] which of these is a better ml strategy
    Using R here - set seed, divide data into train and test only once. Use repeated cross validation on the train data to select the best model out of all the algorithms you're interested in. Finally - use the best performing model on the test data and report the performance metrics. Divide the data into train and test say 10 times by changing the seed. Train model on each train data and report the performance metrics via predict function on each of the test data - now use the performance metrics of the test data to choose the model?? I am trying these strategies because I have very limited 2 sets of biological data with approx n × p ratio of 1 : 7 and 2:7. So the performance when changing the seeds on the test set changes drastically like R2 values varies from negative to 0.50 submitted by /u/triary95 [link] [comments]  ( 2 min )
    [D] GNN Architecture that inputs and outputs both edge and node features?
    I am looking for a GNN architecture/layer that takes in both the node and edge features and outputs both node and edge features too. The only one I could find is the "MetaLayer" in torch geometric from the paper "Relational inductive biases, deep learning, and graph networks" but that is from 2018. There must be something else and new out there. Any links to papers and/or implementations would be appreciated! submitted by /u/Hobo-Wizzard [link] [comments]  ( 1 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 1 min )
    [R][P] A Python package for unsupervised mix data types clustering
    Problem statement Clustering is an unsupervised machine learning approach for clustering unlabeled data. Unfortunately, most clustering methods can only operate with data that has either numeric or category properties. This is a significant issue since most real-world datasets will contain multiple types of features. Solution Today I am releasing the first version of mixclu package for mix data clustering. Please check it out and give it a star if you like it.Mixclu (Mix Clustering ) is an open-source python package for doing unsupervised mix data types clustering. This package puts together a variety of combination models, including kmeans-one-hot, Gower distance, umap, etc. The goal is to provide an easy-to-use implementation for each algorithm. The package is still in progress, and I plan to add deep learning-based methods such as autoencoder and other techniques. I also plan to implement a few papers based on mixed clustering problem statements.Package link: https://github.com/monk1337/Mixclu Actively Looking for contribution in the following task : Implement paper: Affinity Learning for Mixed Data Clustering ( I do have Matlab code & data from the article's author. If anyone wants to contribute and convert the Matlab to python code, please feel free to contact me. Implement paper: A Multi-View Clustering for Mixed Data ( I do have java code & data from the author of the article, please feel free to contact me for contribution ) More features and suggestions are welcome :) submitted by /u/aadityaura [link] [comments]  ( 1 min )
    [Discussion] Name of a (possibly non-existent) paper stating vision-language pretraining can improve performance on text-only tasks?
    I'm trying to remember the name of a paper I think I read, though there's a chance this paper doesn't exist and I'm just mixing it up with something else. In the off-chance I'm remembering it correctly, I think it trained a vision-language model and showed that it outperformed a unimodal text-only model on language tasks? So the idea of the paper was vision-language pretraining doesn't just help in vision-language tasks, but actually improves performance on text-only tasks. If this paper is real, does anyone remember what it was called or how I can find it? submitted by /u/BurnerAccount1100 [link] [comments]  ( 1 min )
    [D] machine learning on sequential data with gaps by design
    hi, in my research, we’re doing binary classification on sequential data. some of the sequences have gaps, or start and end at different points, and this is intentional, since the values are normalized flux at different wavelengths and many points are missing due to absorption features. it is important to know where the missingness occurs in each sequence. essentially, the gaps are intentional and this has created a lot of inconsistency and variance in the dataset, but the points can’t be interpolated. i’ve kind of hit a wall after attempting to do manual feature extraction (how scattered the non-missing data are sat different wavelengths,whether there is a large gap in the middle, etc.). how should i handle this missing-by-design data? is it possible to pad the sequences or fill in the gaps with -99? will a standard RNN still work? thank you in advance. submitted by /u/queenofthekyriarchy [link] [comments]  ( 1 min )
    [D] Clustering high dimensional
    I have a dataset containing multiple cancer cell lines (rows). The dataset features/columns are different genes, where each gene has a value based on their PDUI (polyadenylation site usage index) score (between 0 and 1). The dataset has 53 rows (cell lines) and 12,500 columns (genes). For each cell line, I also have an IC50 value (taken from a different dataset) which shows resistance/sensitivity to an anti-cancer drug being developed. I would like to cluster the cell lines considering all their genes' PDUI scores and then create a plot against IC50, to see if there are any relationships. A problem is that there are some genes where the majority of the cancer cell lines (some cases 99%) have missing-N/A values (as they do not have that gene). This leads to a large portion of the dataset being filled with N/A values. To deal with that I drop every column where more than 20% of the samples do not have a value. For the remaining columns (20% or less N/A values), I use a KNN imputer to fill out their missing values based on their 5 nearest neighbours. This results in approximately 6,000 columns/features remaining with no missing values. What algorithm do you think will perform best for clustering such a high-dimensional dataset? How would you approach such a task? I have of course tried to reduce the dimensions using PCA, reducing the dataset to just 50 features (whilst keeping a 98.8% explained variance ratio). I then tried using HDBSCAN but the algorithm failed to find any clusters. I am open to any suggestions and would greatly appreciate any discussion! submitted by /u/Rafaelkoll [link] [comments]  ( 5 min )
    [P] GPT-NeoX Playground
    We deployed 20B parameter GPT-NeoX model by #EleutherAI for anyone who wants to try it out. (some features are disabled on mobile). GPT-NeoX is 20 Billion parameter language model, which is open-sourced by EleutherAI. Link: https://neox.labml.ai/main Features that I like to highlight here, You can pick random sampling, nucleus sampling, greedy sampling or a beam search. Hover over tokens to see alternatives and the predicted probabilities. Outputs are shown as it receives in the UI. You can find our simple PyTorch annotated re-implementation of GPT-NeoX here. We love to hear your feedback and suggestions. Thank you all, and I appreciate the support. submitted by /u/hnipun [link] [comments]  ( 1 min )
    [P] TF\Keras + GA: is there a "best" one?
    i use tF\Keras + "KerasGA (PyGad)" for 10-30 parameter (climate) models, up to now with not so much success. So I want to try other (Python -based) GA packages, like "Evolutionary Keras" and "Keras+DEAP"... My questions: Are there other powerful GA packages for Keras, which I should try? Is there something like a "best" package for complex, "difficult" problems? submitted by /u/rudel_s [link] [comments]  ( 1 min )
    [D] Why is the skip-gram model used in DeepWalk and Node2vec rather than CBOW?
    I am curious to understand the reason behind the decision to use Skip-gram rather than CBOW for these two models. According to the original Word2vec paper, CBOW is faster to train and captures syntactic similarities better whereas the skip-gram is slower at training but captures more robust semantic similarities and is also better at handling infrequent words. How does this apply to graph theory and what motivated this decision? DeepWalk: https://dl.acm.org/doi/abs/10.1145/2623330.2623732 Node2vec: https://dl.acm.org/doi/abs/10.1145/2939672.2939754 submitted by /u/LanverYT [link] [comments]  ( 1 min )
    [R] [P] Train dataset wih similar objects too close. Do I have to keep only one to detect with bounding boxes?
    Hi guys, I have a training data with many similar objects close together at the scene, the objects are too wide, then if I remove the others from the train image keeping only one, the train image will be too wide then. I'm asking because I will use bounding boxes to recognize the object to count pourpose. Can I keep only one image and fill the rest of the image with black to maintain a square aspect ratio? Probably I willl have to do some data augmentation too (rotating). Thanks! submitted by /u/MrWrodgy [link] [comments]  ( 2 min )
    [R] Happy to share our latest Research paper : MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering
    I am pleased to share the news with the ML community that the team I work for recently released a new benchmark: MedMCQA, a new large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions. Our paper was accepted at Conference on Health, Inference, and Learning (CHIL) 2022 and published in Proceedings of Machine Learning Research (PMLR). ​ MedMCQA sample questions ​ The main contributions: ​ MedMCQA has More than 194k high-quality Medical entrance exam MCQs. Dataset requires a deeper domain and language understanding as it tests the 10+ reasoning abilities of a model across a wide range of medical subjects & topics. It Covers 2.4k healthcare topics and 21 medical subjects with an average token length of 12.77, the …  ( 2 min )
  • Open

    Apple’s former ML director reportedly joins Google DeepMind
    submitted by /u/KendraMontgomery [link] [comments]
    Hack3: The Leading Online Hackathon for High Schoolers!
    ​ https://preview.redd.it/lmy70wdji3191.png?width=1270&format=png&auto=webp&s=a7737c6828ea94682d0dcef58f7afb391bcba0d0 Attention to curious high schoolers! Hack3 is hosting an online hackathon for high schoolers for 24 hours on June 25-26. In 2020, we connected nearly 300 students of all skill levels, to learn to build innovative projects that positively impacted the world. Over 100 attended our free classes led by industry professionals to learn new skills . Over twenty mentors were in our help desk to help participants when they needed help. Last year, our judges, mentors, and workshop instructors were affiliated with the likes of Stanford, Harvard, Amazon, NetApp, and Wikipedia. In 2021, we connected over 350 students of all skill levels, to learn to build innovative projects that positively impacted the world. Over 150 attended our free classes led by industry professionals to learn new skills. Over 30 mentors were in our help desk to help participants when they needed help. Last year, our judges, mentors, and workshop instructors were affiliated with the likes of Amazon, NetApp, Balsamiq, Nexus Bytes, Replit, Postman, and Wolfram Language. This year, with the lessons learned from 2021, we aim to host a competition consisting of over 500 participants, while targeting the underprivileged communities around the world. To help achieve our goal of providing a learning opportunity for everyone, we will be sponsoring internet access to those who need it to truly level the playing field for all. Are you down? Register on DevPost here. submitted by /u/thegreatestgemini [link] [comments]  ( 1 min )
    General AI through scaling? Meta's AI chief Yann LeCun speaks out: "We have a number of obstacles to clear, and we don’t know how."
    submitted by /u/much_successes [link] [comments]  ( 1 min )
    AI book reading tool
    Greetings Folks, In the past year, we had released a book reading AI tool to search for content within files using natural search, and we had received constructive feedback from the community. Today we are releasing, a new updated version with a fresh UI overhaul (desktop support).https://rastero.io/intro Glad to share it with you all. Here's an account to try it out! username: reddit password: reddit2022 https://reddit.com/link/uvhgx5/video/7b49kj1jq2191/player submitted by /u/deep_ak [link] [comments]  ( 1 min )
    HYDRA is the first spatial perception engine that builds a 3D scene graph (geometry and semantics) from sensor data in realtime
    submitted by /u/SpatialComputing [link] [comments]  ( 1 min )
    10 Ways AI Will Change The World By 2050
    By 2050, AI will reach remarkable advancements that will be beyond many people's wildest dreams. Robots will not only be able to attain, but also generate, that task in a cost-effective, timely, and meticulous manner, hence increasing efficiency. Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    Should AI Be Democratized?
    AI is considered a “general-purpose technology” due to its pervasive role in various industries. Many researchers argue that AI should be employed just in highly-impact domains, while others advocate the consensus to democratize AI so that the benefits are not restricted to a small group of people. Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    AI Dream 53 - Cosmic Creation | MASTERPIECE TEASER
    submitted by /u/LordPewPew777 [link] [comments]
    Is there a AI which is able to combine 2 or more images?
    Or is there a AI which I can use to combine 2 images: like a dragon from one image and a background from a other images etc.? submitted by /u/xXNOdrugsForMEXx [link] [comments]
    How to Control an AGI via Motivation Selection
    My Dear AI Fellows, Please check out my latest video about how to control an AGI via Motivation Selection: https://youtu.be/rLB4xkwgEAw I also have a lot of great content on the channel regarding life 3.0, building an AGI, AGI Safety, etc. Please check them out and subscribe to my channel! submitted by /u/billgggggg [link] [comments]
    Greg Brockman on Twitter - I also really like this one, created by the prompt "DALL-E dreaming of becoming an AGI"
    submitted by /u/mofosyne [link] [comments]  ( 1 min )
    Building an ACOG (part 1) (artificial cognitive entity)
    submitted by /u/DavidKShapiro [link] [comments]
    2029 AGI really tho?
    Tell me, is that far fetched? I had an argument with a family member about when AGI arrives and if it even arrives at all. Is 2029 a reasonable date? I think so, but I need some backup submitted by /u/Ashamed-Asparagus-93 [link] [comments]  ( 2 min )
    For games - A lot of Artificial Intelligences are just really deep exploration trees. Not learning anything intuitive like a human
    This has been bugging me for a bit. When it comes to Chess - Stockfish is powerful because it can search incredibly deep. It doesn't 'look' at a position and 'think' a move is good. Maybe Alphazero is slightly different - as that has the deep search plus the evaluation? I can't remember exactly tbh. Poker AIs are just very deep brute force approaches too. Getting experience through tonnes of simulations. AlphaStar has to be different - this does learn general principles as the games are far too different each time. AlphaGo - I don't think I have the knowledge to comment on this one, but from the Chess and Poker ones, I would assume it's just a very deep search, but not 100% sure if anyone can clarify. Been a while since I watched the doc / read about it. I also experimented with a NN that 'solves' tic tac toe. It was a hello world level tutorial. But it didn't do anything special - the neural network was just memorizing the optimal solution for every position. When I expanded the game from TTT to connect 4, I realised it wasn't 'learning' anything from new positions. It was just memorizing every position - not impressive. Is this basically where we are at with AIs? They just have to play / simulate millions - trillions of times to outperform humans? submitted by /u/Cwlrs [link] [comments]  ( 1 min )
  • Open

    What strategies do you use to normalize data like distances to objectives?
    Hey, I am exploring neural networks and wondering how people handle unknown data ranges like distance. I'm not sure if I want to decide on an arbitrary maximum distance and normalize for that or if there's a standard way to deal with these kinds of things. One idea I had is to not use the actual distance but instead give a normalized value of the significance. If something is beyond a certain distance then it has an influence of 0 and if it's something like the creature is about to walk into a fire then the significance would be close to 1. Does that make sense? I've been seeing interesting evolution simulators (no training, occasional random changes and connections between neurons) and wanted to try my own. I made a creature with an x coordinate and known distance to a trap. I wasn't sure how to normalize the x coordinate or distance at large values so I thought of my 2nd idea there instead. Maybe that makes more sense for a sensory input anyway? Open to ideas or lexicon that will help me find an answer, thanks! submitted by /u/Tomnnn [link] [comments]  ( 1 min )
    Overfitting in first epochs
    Hey everyone, I am training a CNN network on the DVS gesture dataset using PyTorch. However, the training is not progressing in a soft way, the accuracies of both training and validation fluctuate a lot, they are both progressing, but there is a big difference between them (5~6% up to 10%) as if there is overfitting in 3/4 epoch. I have tried L2 regularization as well as a dropout with high values, the difference disappears in the first iterations but reappears strongly afterward, and I am sure that datasets are perfectly merged and split randomly. PS: May this be an underfit, how to identify an underfit ? Thanks in advance! submitted by /u/StartFinancial5917 [link] [comments]  ( 2 min )

  • Open

    Can I somehow enhance the quality of this image
    I dont know if this the right subreddit to post on but I want to enhance this image https://preview.redd.it/pq435gmd3x091.jpg?width=626&format=pjpg&auto=webp&s=5a01c6de7402dcb8d83e069cc5410a6c9c9d9017 submitted by /u/AdnanZXgamer [link] [comments]
    Without fail 😁🔥
    submitted by /u/p0goniphaft111 [link] [comments]
    who is the Godfather of AI?
    Howcome Krishnamurti said in the year 1981 that 'the computer can do ANYTHING a human being can do'? .. at a time when personal computers where very slow compared to today This might be a longer rant about these questions and offers a perspective where I suggest answers, as well as regaining control of a territory once lost: The mind. Lets start with the almost three hour long documentary by Adam Curtis called Hypernormalisation, in which he describes how governments no longer attempt to deal with the complexity of the real world, but only to the models created. What should have been a map describing the world, assumed a primary role for actions and decisions. Reality becomes 'something you handle' and perception management of the masses became important. The proces described by Ada…  ( 2 min )
    Can we train AI to capture the process of scientific evolution, and then fast-forward it to obtain future technology?
    Partly inspired by this article: https://www.quantamagazine.org/machine-scientists-distill-the-laws-of-physics-from-raw-data-20220510/, which describes AI that discovers new biology/physics equations from raw data. My question is: humans have come a long way from throwing stones to having all the technologies today. This thousands of years of evolution is a process in which new knowledge is developed from existing knowledge–countless cycles of observation, experimentation, and conclusion. Is it possible then, to train an AI to capture this process of generating new knowledge from existing knowledge, and use this AI to fast-forward scientific evolution, thus quickly obtaining future technology that would otherwise take decades to develop? submitted by /u/Independent_Ant_2027 [link] [comments]  ( 1 min )
    Pretty roses.
    submitted by /u/cookingandcraft [link] [comments]
    GATO outperformed experts on 450 tasks out of 604. A bold step towards an AGI
    submitted by /u/imapurplemango [link] [comments]  ( 1 min )
    What website to make image like this just with prompt?
    submitted by /u/Due-Ad9795 [link] [comments]
    How Uber uses AI to serve you better
    submitted by /u/OnlyProggingForFun [link] [comments]
    Apple Executive Who Left Over Return-to-Office Policy Joins Google AI Unit
    submitted by /u/bartturner [link] [comments]
    A Look At 3 Big Risks With AI
    https://www.youtube.com/watch?v=0kEqqP8PlUw submitted by /u/kbf_ [link] [comments]
    Best Natural Language Processing Books for Beginners to read in 2022
    submitted by /u/maneesh123456 [link] [comments]
    I'm looking for AI image generators that make art based on the image(s) you provide
    Hi, instead of text-to-image AI generators, I'm wondering if there are any alternatives to The Looking Glass AI. The Looking Glass AI asks you for an image, to generate images based on them and a theme input. I tried looking for alternatives myself but I was only able to come across text-to-image generators so far. submitted by /u/DTanya [link] [comments]  ( 1 min )
  • Open

    Research that may interest those working on the development of AGI (it identifies how consciousness/cognition evolved in living processes)
    A number of key challenges stand in the way of the development of HLAI and AGI. However, overcoming them has been handicapped by a lack of relevant progress in the human sciences: they have failed to produce clear understandings of how some key functions that are absent in current AI are enabled in humans. A paper just published in the journal BioSystems deals with some of these key issues: for example, it identifies at a cybernetic/systems level how consciousness, 'real-time' sensorimotor coordination, and mental modelling/cognition arose and developed in living organisms. These insights might prove useful in identifying how these functions could be instantiated and developed in AI. Titled 'The evolution and development of consciousness: the subject-object emergence hypothesis', the fu…  ( 2 min )
    What is the Alpha go training time measured in human time?
    I was listening to a talk where it was mentioned that the Alpha go training time in human time is 56 years. But there was no reference for that information. I'm wondering whether it's true, and how this was calculated. Any leads are appreciated, thanks! submitted by /u/exenson [link] [comments]  ( 1 min )
    Is there a way to sample a constant action for the whole episode?
    Hi, Usually actions are sampled for each step and applied to the agent. However, is there a way to sample an action that remains constant until the episode ends? I am using Isaac Gym. submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    PPO learns how to play browser-based game
    Hi everyone, From past years, I had a JavaScript game I wrote as a fun side-project. Now that I'm studying reinforcement learning, I wanted to see if I could code PPO into the game so that it could train *directly in the browser* without needing Python or any other dependencies... Well, I couldn't find any implementations of PPO in JavaScript, so I decided to code it into the game myself. I'm really happy with the end result - I've even included a pretrained agent that can autoplay the game live! (it beats the game ~85% of the time 😅) Source code: https://github.com/hmomin/ppo-winter-run Game: https://winter-run.com Hope you enjoy! I'm wondering if you all have any interest in a PPO implementation for JavaScript/TypeScript? If so, I may rework the source code a bit to play more easily with gym-like environments and release it as a separate project in a future post... submitted by /u/hmomin [link] [comments]  ( 1 min )
    Recommendation system with limited data?
    Hi! I’m attempting to builder a recommendation system using reinforcement learning for a side project. I was wondering if there are algorithms that work well with limited data? Thanks I’m advance! Edit: also wondering if there are any systems that are trained by reaching a goal(beating an opponent in games) and are used in a different field(not gaming) submitted by /u/brioche789 [link] [comments]  ( 1 min )
    12 Best Courses to Learn Deep Learning
    submitted by /u/MlTut [link] [comments]
  • Open

    [R][P] Gradio Web demo for StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators (SIGGRAPH 2022)
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    Deepfakes - How Long Do They Take to Make? [D]
    How long do deepfakes take to make Example scenario I make a 1 minute video where I'm talking to the camera I want to swap out my face for someone else I want it to be realistic (So people wouldn't be able to know its not my face) Questions Best way to achieve this? How long does it take? Easiest way to achieve this? Is DeepfaceLab the best option? submitted by /u/AviatorPrints [link] [comments]  ( 1 min )
    [N] Surgical Tracking Challenge
    We, at the Hamlyn Centre - Imperial College London, will be hosting a first of its kind Tracking Challenge for Surgery at MICCAI 2022. If you are interested, please visit: https://surgt.grand-challenge.org submitted by /u/aweld20 [link] [comments]
    [D] Looking for examples of failures of correlation-based NLP
    I am looking for examples (eg screenshots of dialogues) of using recent NLP models and getting unintuitive results because the model infers a wrong causal relationship between two words because they occur together often. submitted by /u/dr_cosmicomical [link] [comments]
    [D] Character-level vs. word-level tokenization
    Hi all, I'm relatively new to the field of NLP and while reading a blog post from 2015 The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy, I was wondering about this part of the "Further Reading" section: Currently it seems that word-level models work better than character-level models, but this is surely a temporary thing. Aren't most state-of-the art models these days using some kind of vocabulary, i.e. whole words or at least sub-words? Text in the wild can be full of typos, emojis or other unicode crazyness, so wouldn't training all these LLMs on a character level make them more flexible and better applicable to real life problems? I'd love to hear your opinions about this and to be pointed towards good resources to learn more about different tokenization methods and their limitations and performance implications. Cheers. submitted by /u/CodeAllDay1337 [link] [comments]  ( 2 min )
    [Project] How to create visualizations for "complex" networks?
    For a presentation I need to make my own visualization of a graph neural network with encoders and other additional parts. I would like to get something like this: ​ https://preview.redd.it/yi1erwam1t091.png?width=868&format=png&auto=webp&s=adabbe066182b2065b835ea50e8fc83c3574910a Does anyone know a convenient way to do this? submitted by /u/mr_birrd [link] [comments]  ( 2 min )
    [N] Introducing PeerXiv - A modern platform for peer-review of preprints
    (Check out the Twitter thread) What would a peer review process look like if it was designed today? Peer review is one of the cornerstones of the research community, and yet while our community keeps advancing and growing, the reviewing process remains almost unchanged. We strongly believe that peer review can be so much better for both authors and reviewers and we are excited to share PeerXiv, our proposal to do just that. Check out the PeerXiv Mock: https://peerxiv.web.app/about ​ https://i.redd.it/g0nv9iz94s091.gif PeerXiv is a modern platform for the peer review of preprints. Authors can submit their preprints and get feedback directly from a set of anonymous PeerXiv reviewers who earn reputation points for their effort📈 ​ https://preview.redd.it/5rfck4kb4s091.jpg?width=1022&…  ( 2 min )
  • Open

    Decoding a grid square
    I saw a reference last night to the grid square EL29fx and wanted to figure out where that is. There are many programs that will do this for you, but I wanted to do it by hand. I wrote about how grid squares work a year ago, but I was rusty on the details, so […] Decoding a grid square first appeared on John D. Cook.  ( 2 min )
    Exponential of a line
    Draw a line in the complex plane. What is the image of that line when you apply the exponential function? A line through w with direction z is the set of points w + tz where w and z are complex and t ranges over the real numbers. The image of this line is exp(w+ […] Exponential of a line first appeared on John D. Cook.  ( 1 min )
    Discrete derivatives
    If you’ve taken calculus, and someone asks you what the derivative of x5 is, you can say without hesitation that it’s 5x4. Now suppose they come back and say, “I’m sorry. I forgot to give you any context. Here x5 is a polynomial in the field of 343 elements.” It turns out that this additional […] Discrete derivatives first appeared on John D. Cook.  ( 2 min )
  • Open

    Monkey Patching Python Code
    Python is a dynamic scripting language. Not only does it have a dynamic type system where a variable can be assigned to one type first and changed later, but its object model is also dynamic. This allows us to modify its behavior at run time. A consequence of this is the possibility of monkey patching. […] The post Monkey Patching Python Code appeared first on Machine Learning Mastery.  ( 13 min )
  • Open

    Unleash Your Dragon (Remastered 8K 60FPS)
    submitted by /u/stepanmetior [link] [comments]
  • Open

    Keeping Your Company’s Data Model IP
    Robotic process automation (RPA) when it first gained attention five or so years ago disappointed me. Here was yet another stopgap measure, a digital form of tape and baling wire to temporarily join scripts of oft-repeated tasks in a specified workflow sequence across commonly used applications. RPA seemed quite brittle–wouldn’t you have to regenerate the… Read More »Keeping Your Company’s Data Model IP The post Keeping Your Company’s Data Model IP appeared first on Data Science Central.  ( 5 min )

  • Open

    [R] A paper that contributes to the development of AGI by identifying how consciousness/cognition evolved in living processes
    A number of key challenges stand in the way of the development of HLAI and AGI. However, overcoming them has been handicapped by a lack of relevant progress in the human sciences: they have failed to produce clear understandings of how some key functions that are absent in current AI are enabled in humans. A paper just published in the journal BioSystems deals with some of these key issues: for example, it identifies at a cybernetic/systems level how consciousness, 'real-time' sensorimotor coordination, and mental modelling/cognition arose and developed in living organisms. These insights might prove useful in identifying how these functions could be instantiated and developed in AI. Titled 'The evolution and development of consciousness: the subject-object emergence hypothesis', the fu…  ( 2 min )
    [N] Introducing NGC-Learn: Predictive Coding and Neurobiologically-Motivated Learning in Python
    Interested in doing research in neurobiologically-inspired artificial neural networks? Need an open-source, actively maintained tool for reproducing the latest paper on predictive coding or building your own more biologically-faithful neural system? ngc-learn is a recently-released Python library designed in response to these questions. The ngc-learn dynamics simulator is specifically meant for building, simulating, and analyzing arbitrary predictive coding models based on the neural generative coding (NGC) computational framework and theoretically guided by the free energy principle. This toolkit, distributed under the 3-Clause BSD license, is built on top of Tensorflow 2. Notably, ngc-learn's extensible nodes-and-cables system is general and can even be used to build non-predictive cod…  ( 1 min )
    [D] How would one even maintain a generalist agent like Gato?
    DeepMind just published a paper on Gato which is a generalist agent that can perform more than 600+ tasks. I get that there's still a huge debate/discussion about AGI, HLAI, ethical concerns, etc., so such an agent is unlikely to be deployed in production soon, but let's just entertain that idea for a second, how would one even maintain a model like that? If it performs well for all tasks except for one, how would you retrain only for that one tasks that it underperformed? The model uses the same weights and biases for all the tasks, so even if you choose to freeze certain nodes, how would you even go about to do that? submitted by /u/Lexayne [link] [comments]  ( 1 min )
    "[Project]" Brainchop: In-browser deep learning framework for volumetric Segmentation
    ​ https://preview.redd.it/udqgrwbzlo091.png?width=570&format=png&auto=webp&s=d856ab9334e6a0c5ae052c762fdfb8d6e22cfb61 Live Demo: brainchop.org Brainchop is a client-side web-application for automatic segmentation of MRI volumes that brings automatic volumetric segmentation capability to neuroimaging by running a robustly pre-trained deep learning model. The app does not require technical sophistication from the user and is designed for locally and privately segmenting user’s T1 volumes. Results of the segmentation may be easily saved locally after the computation. An intuitive interactive interface that does not require any special training nor specific instruction to run enables access to a state of the art deep learning brain segmentation for anyone with a modern browser (e.g. Firefox, Chrome etc) and commonly available hardware. Additionally, we make implementation of brainchop freely available releasing its pure Javascript code as open-source. https://preview.redd.it/q9q0zq3wlo091.png?width=3333&format=png&auto=webp&s=0110de38b24890373f4640166c53de508969ded0 submitted by /u/Character-Rip-5824 [link] [comments]  ( 1 min )
    What libs/boiler plate/platforms do you use to abstract and optimize your workflow when starting a new project? [D]
    E.g.. New data project, currently investigating the data. I'm joining multiple data sources to create a standard data object represented by X number of features. Now I want to get a sense of the distribution. and skew of each feature (basic stat analysis), do some elementary clustering, and be able to repeat the process as I add, change or remove features. I can write abstractions in code to do these things such as a function that plots the frequency of values for a specific feature + gives mean,mode,median etc but I'm wondering if there are libraries or platforms you all use to avoid this tediousness. I understand that may sacrifice the customizable nature of writing your own code however I'm interested in such things to get a quick sense of the metadata. submitted by /u/gravbeamemitter [link] [comments]  ( 1 min )
    [P] SSO (Single Sign-On) for CVAT, the computer vision annotation tool
    For those who are interested in using CVAT with SSO, previously I made a proof-of-concept video to demonstrate my SSO implementation for CVAT: https://www.youtube.com/watch?v=R7hBBLG5Fdc Now I'm happy to announce that I have submitted my code changes: https://github.com/AlexGaoDW/cvat/tree/feature/datawiza-sso And I've created a PR to get it into the official repo. You can try it out by yourself following the document here: https://docs.datawiza.com/guides/cvat.html I also set up an instance using Google as the identity provider such that you can try SSO functionality with your Google account: https://cvat-sso.datawiza.net/ Enjoy! submitted by /u/alexcgg1 [link] [comments]  ( 1 min )
    [R] Proof Of Useful Work For AI Training On The Blockchain
    submitted by /u/EducationalCicada [link] [comments]  ( 2 min )
    [D] IEEE Access Article accepted as the first author, what's next?
    Dear members, This is my very first experience in submitting to IEEE Access as a first author. My article was accepted after submitting minor edits with a rebuttal. However, in my acceptance email, the reviewer asks me to incorporate his/her comments including revamping the sections, adding elements in figures, more explanations, and adding references. The same email also states that "you are not permitted to add or remove authors or references post-acceptance" and "to take this opportunity to improve the English grammar and check spelling, as the article is only lightly edited before publication". This has become a very confusing situation for me. Anyone having experience in publishing in this journal is requested for guidance, please. submitted by /u/HQ2020 [link] [comments]  ( 2 min )
    [P] Add conditional node to random forest?
    Hi all, Premise that I am pretty new to ML, so pardon me if I say anything wrong. I am wondering if it would be possible to do the following: a modified semi-supervised RandomForest, with a conditional node added as a positive branch to all decision trees which controls for certain features? Basically I would like for the algorithm to check the quantitative level of a feature (dependent variable) in samples positive for the said feature (independent variable). The algorithm has to set a cutoff for the quantitative variable only if the categorical variable is TRUE (such is the case when the quantitative variable is =/ 0). Does it make sense? How can I add such conditional statement? I mainly operate on R using RandomForest but I am also a python user, so open to try scikit learn. Thanks submitted by /u/ElMochiKris [link] [comments]  ( 1 min )
    [D] Mining restaurant reviews for vibe
    tl;dr: I recently procured 1m+ restaurant reviews. Now I'd like to see if I can mine any hidden and interesting information from them to build a search system. Looking for advice. If you were doing this project - "determine the vibe of a restaurant based on the reviews. Allow a user to search based on expected loudness at venue" - how would you approach it? Boring text about what I've tried: So far, I've tried using top2vec on different granularity of the reviews. For example: each document is all reviews for a restaurant concatenated together each document is a complete review each document is a paragraph each document is a single sentence If I search for topics, I can certainly find some interesting ones. Here are the top 10 from the "complete review" model, for example: …  ( 3 min )
    [P] Extract relevant text from error logs
    I'm trying to extract relevant text from error logs (example below) then classifying by error reason. I've been doing (but mostly failing) it using the following process: Regex to filter out text between brackets and removing dates, times, IPs, filepaths, etc Remove stop words and keep only real words Put the resulting string of words through a sentence transformer (all-mpnet-base-v2 in this case) to get embeddings Reduce dimensionality and cluster Within clusters, use the most frequent words (up to 5) as a label for the "error reason" This process is really brittle because of the regex and because there is so much variability in log structures. Furthermore, the "error reason" labels are usually nonsensical because the process doesn't properly extract the relevant text. For the example below, I get "failed_run_coverage_statements_panic", when I'd hope to get something relating to timing out. Does anyone have any ideas of how I might approach this? Example error log: Failed === RUN GoJobSubscriber {"level":"info","ts":1649273105.2566006,"caller":"rabbitmq/connection.go:20","msg":"client successfully connected to RMQ"} {"level":"info","ts":1649273105.2577667,"caller":"rabbitmq/connection.go:28","msg":"client successfully created channel"} {"level":"info","ts":1649273105.258319,"caller":"rabbitmq/connection.go:34","msg":"client successfully configured QOS"} {"level":"info","ts":1649273105.2584598,"caller":"rabbitmq/connection.go:45","msg":"client successfully assigned dependencies"} {"level":"info","ts":1649273105.2894936,"caller":"rabbitmq/connection.go:52","msg":"client successfully ensured topology"} {"level":"info","ts":1649273105.2906847,"caller":"rabbitmq/subscription.go:46","msg":"Starting event subscriber...."} coverage: 55.8% of statements panic: test timed out after 2m0s goroutine 11 [running]: testing.(*M).startAlarm.func1() submitted by /u/iblis3 [link] [comments]  ( 1 min )
    [D] Do you use any tools to automate the data cleaning process?
    Data cleaning is very tedious. Are there solutions that actually solve the data cleaning problem or is it that we have to go through it manually because that helps to learn about the data? submitted by /u/Speech_titan [link] [comments]  ( 1 min )
    [2205.08957] Meta-Learning Sparse Compression Networks
    submitted by /u/Forsaken_Scientist [link] [comments]
    [D] Thinking about buying a Jetson Nano and Coral Stick for ML
    Currently I use several hours a day google colab and kaggle kernels to train models. But with the limitations of the free usage I ask myself worth it to buy a Jetson Nano 4GB and using the Coral Stick together as an alternative or buying colab pro is better? Most of time i use Tensorflow for mostly image/video classification/object detection Is it worth buying or has anyone had similar suggestions or experiences? submitted by /u/Stevy1981 [link] [comments]  ( 1 min )
    [D] Google Colab equivalent alternative to R / Rstudio
    I want to use GPU for running some R code I know that Google Colab is a good alternative if using python and allows access to GPU - is there an equivalent alternative to RStudio? What about RStudio server what are the advantages of Using Rstudio server when it anyway uses the computer CPU? submitted by /u/triary95 [link] [comments]  ( 1 min )
    [P] Reviewing Marketing Mix Model - concerned about a few issues...
    Hi, everyone. I am reviewing a marketing mix model done by someone before I started the job. They have a model which has been claimed to give an accurate attribution across different medial channels and other variables. It gives a pretty good result (R2 = 0.95), I admit. But I was concerned by the way the long-term effect of advertising has been worked into the model. The current model uses the sum of brand investment with a 95% carryover effect to account for the 'long term effect', while also using the same brand investment figure to account for the 'short term effect'. I pointed out that this would raise multicollinearity issues, but was told that this isn't a problem as the "modeling was done with an RF". Is there a way to better explain my concern? submitted by /u/SuperSodori [link] [comments]  ( 1 min )
    [D] Current use of word2vec in 2022
    We can all agree that word2vec was revolutionary in the field of NLP. My question is what is the current state of this technique? Is it mainly used for educational purposes to teach students about the building blocks of word embeddings? - Is it used in other fields such as studying the evolution of languages through historical texts? - Is it used in specific tasks such as only syntactic similarities, or only semantic similarities? - ... submitted by /u/LanverYT [link] [comments]  ( 2 min )
    [D] Is Mixed Market Modelling full of crap?
    I have spent the last year working as an Analyst for a large media company and have been primarily occupied with MMM projects. I need to clarify that I had never previously worked/studied economics and I come from a natural sciences background with a focus on statistics.Since I have been in this company, I have been struggling so much for several different reasons. One of these reasons is because I feel like the work we're doing is complete and utter bullshit! I am trying to figure out if the problem is me (which I am sure that some of it is) or if this is a common phenomenon. Every time I start modelling I am super enthusiastic about it and determined that this time it'll all make sense and every time I end up exasperated and ready to give up, quit my job and never come back. I feel like the models we are building are so full of crap, and simply aimed at justifying our client's expectations in order to make them happy - we always end up using mad coefficients and breakdowns of variables just so that they are what they should be.I consider myself to be very analytical and with good problem-solving skills - have been praised by my managers and given really good feedback for exceeding expectations and all that jazz. HOWEVER, I constantly feel like an utter failure, because I spend so much time trying to make sense of these models to the point where I exhaust myself and give up and then just do what I feel everyone else does - manipulate the metrics so that it works our way and this kills my soul every single time. Is this something that is widely happening in the industry and am I being too idealistic/perfectionist? Or am I seriously lacking some training (that I wouldn't say I got much of) and what can I do to improve? P.S. I am at the point where I have almost given up and I am close to leaving my job. I have run my mental health down so much to the point of burnout so any advice would be so very much appreciated! submitted by /u/necplorer [link] [comments]  ( 6 min )
    [P] Training to read PDF documents. Any ideas?
    Currently I use regex to identify patterns coupled with the OCR library (Tesseract) to import PDF data into dataframes. It's a bit of hit n miss process. Every new exception has to be built into regex. I was wondering if machine learning can take up the job? I was wondering one way to go about it is marking positions of data from the OCR to be read into respective columns but there might be a better way to go about it. Has anybody done this? submitted by /u/card_chase [link] [comments]  ( 4 min )
    [R] AI Research Rankings 2022: Sputnik Moment for China?
    Hey Reddit! A heated debate is going on today on the state of the strategic race between the United States and China to dominate in AI. I decided to gather some facts and analyzed publications at ICML 2021 and NeurIPS 2021. Here are the findings -- would love to hear what you think! 🤝❤️🤖 https://thundermark.medium.com/ai-research-rankings-2022-sputnik-moment-for-china-64b693386a4 submitted by /u/chersonesus [link] [comments]  ( 2 min )
  • Open

    Sim-2-real problem regarding system delay
    If the goal lies in training an agent for robot control policy, the actions stand for current values which control the robot joints. In the real system, however, there exist system delays and communication delays. So applying the actions to the robot would not directly result in motions, which is however in the case of simulation (for instance ISAAC GYM that I am using). As I have measured, the real system takes 250~300 ms to react to the given system input and rotate its joints. Therefore, the control policy trained in the simulator, where the system delay is almost 0~15 ms, is not useable anymore. What would be the approaches to overcome this sim-2-real problem in this case without identifying the model of the system? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    Hierarchical Reinforcement Learning (HRL)
    Hello, I intend to apply HRL more specifically the option-critic where I have three options. however, I don´t see many documentation in the internet. Do you have any implementation code to get inspire from it (to see how it works). submitted by /u/GuavaAgreeable208 [link] [comments]  ( 1 min )
    Cliff Diving: Exploring Reward Surfaces in Reinforcement Learning Environments
    submitted by /u/jkterry1 [link] [comments]
    Ideas to start writing papers
    Hey there! I'm eager to start writing a paper, I have a background in AI, but I can't think of ideas for a topic to write a paper about. How do you guys get ideas for papers? Is it a good idea to start writing surveys? submitted by /u/ZazieIsInTheHouse [link] [comments]  ( 1 min )
    Channel policy on RL-related job opportunities
    Hi all, I am aware of two job opportunities at my company in the area of experimentation and RL (mainly MAB and cMAB style research and application). I was curious what this channels policy was on making people aware of these job openings or if there was a better venue for that (discord channels, etc). Let me know! Thanks! submitted by /u/Helga-Helga [link] [comments]  ( 1 min )
    Let's build an Autonomous Taxi 🚖 using Q-Learning (Deep Reinforcement Learning Free Class by Hugging Face 🤗)
    Hey there! I’m happy to announce that we just published the second Unit of Deep Reinforcement Learning Class) 🥳 In this Unit, we're going to dive deeper into one of the Reinforcement Learning methods: value-based methods and study our first RL algorithm: Q-Learning. We'll also implement our first RL agent from scratch: a Q-Learning agent and will train it in two environments and share it with the community: Frozen-Lake-v1 ⛄ (non-slippery version): where our agent will need to go from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoiding holes (H). An autonomous taxi 🚕 will need to learn to navigate a city to transport its passengers from point A to point B. You’ll be able to compare the results of your Q-Learning agent using the leaderboard 🏆 1️⃣ The introduction to q-learning part 1 article 👉 https://huggingface.co/blog/deep-rl-q-part1 2️⃣ The introduction to q-learning part 2 article 👉 https://huggingface.co/blog/deep-rl-q-part2 3️⃣ The hands-on 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit2/unit2.ipynb 4️⃣ The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard https://i.redd.it/2zk7rb9w9n091.gif If you have questions and feedback I would love to answer, submitted by /u/cranthir_ [link] [comments]  ( 1 min )
    Any recommended paper on using Deep RL on Drones
    Hi, I am a beginner in using Deep RL on robots (especially Drones). I have tried on RL OpenAI gym environments like Cartpole, Breakout and BipedalWalker but I have never applied it on actual robot. Now I wanted to apply DDPG on a drone (in simulation), I have failed miserably several times now. Is there any paper that explains it in detail? Any others resources are also welcomed. submitted by /u/Better-Ad8608 [link] [comments]  ( 1 min )
    What is offline reinforcement learning?
    Offline Reinforcement Learning(RL), also known as Batch Reinforcement Learning, is a variant of RL that effectively leverages large, previously collected datasets for large-scale real-world applications. The use of static datasets means that during the training process of the agent, offline RL does not perform any form of online interaction and exploration, which is also the most significant difference from online reinforcement learning methods. For convenience, we refer to non-offline reinforcement learning, including both on-policy and off-policy RL, as online reinforcement learning (Online RL) in the following sections. ​ Illustration of three classic online reinforcement learning modes In the figure, (a) stands for On-policy RL, where the agent uses the current policy πk to interact…  ( 8 min )
    Looking for project mates
    I'm trying to reimplement the paper High-Throughput Synchronous Deep RL using Ray. Is anyone interested in working on this together? submitted by /u/SirRantcelot [link] [comments]  ( 1 min )
  • Open

    Excuse the language, but why the fuck is there nothing accessible to the public right now that can generate genuinely comprehensible images like Dalle 2?
    Forgive me for sounding irritated, but is there absolutely nothing I can use right now, be it a website, program, mobile app, whatever, that I can just pump some cool stuff out of? Why are we all being shown this amazing technology only to be told "Yo this sick tech exists, but you ain't fuckin using it lol" except for a few select people (e.g. MKBHD's access to dalle 2 for a day).. I'm not talking about excuses such as nightcafe or wombodream.. I've used them to death and had nothing but quite frankly terrible results and want something that I can pop in terms such as, I don't know, "Drift car nissan silvia" that gives a picture of an actual car for album art etc, or something along those lines, if not just to admire how crazy AI is becoming. Look, if this stuff existed but was kept completely secret, then the whole 'what you don't know can't hurt you' idea would apply and I would not be pissed off. What is frustrating is that all this wizardry is being flaunted and dangled in our faces whilst also being kept out of reach. So my initial question still applies, is there anything remotely available right now that can generate images somewhat comparible? Thanks! submitted by /u/Michael_Goodwin [link] [comments]  ( 3 min )
    TPI - an open-source tool that is the first to train machine learning models on any cloud using Terraform solution
    Terraform Provider Iterative (TPI) is first technological product that simplifies ML training on any cloud as it helps the infrastructure and ML team members save significant time and money in maintaining and configuring the training infrastructure: Iterative adoption of an open-source tool that is the first to train machine learning models on any cloud using Terraform Terraform provides a flexible CLI service system for managing hundreds of cloud services, and TPI enables data scientists to delegate responsibilities without discovering software. submitted by /u/thumbsdrivesmecrazy [link] [comments]  ( 1 min )
    Rephrasing an English paragraph with using AI - Writing is all about culture
    submitted by /u/data-gig [link] [comments]
    OpenAI's DALL-E 2 is pretty compliant - but who is responsible anyway?
    submitted by /u/much_successes [link] [comments]  ( 1 min )
    A Roadmap overview To A Career In AI
    According to the World Economic Forum, over 97 million more jobs may be created by 2025, all due to the application of AI in different fields. AI is one of the century’s most important and disruptive technology achievements. Here is A Roadmap overview To A Career In AI submitted by /u/artiba-AI [link] [comments]
    This article showcases a tutorial on fine-tuning layoutLM v2 for invoice recognition starting from annotation to training and inference.
    submitted by /u/UBIAI [link] [comments]
    Automated CV Pipelines | Box to Polygon
    The 5th episode of the webinar series on Automated CV Pipelines is coming up! It will be covering automatic instance segmentation and methods to streamline the annotation process. If you're interested, you can register here! submitted by /u/WeekendClassic [link] [comments]
    Google AI Plans for 2030
    When someone searches a query on Google, it leaves a carbon dioxide footprint in the atmosphere. Google AI plans for 2030 have been revealed, discussing what goals Google has to lower its carbon footprint and allowing other partners to play their part in guaranteeing environmental sustainability. Read more submitted by /u/ridamughal110 [link] [comments]
    Top AI Investment Opportunities In 2021
    There are multiple reasons companies are finding investment opportunities in AI that are going to be beneficial for them in the year 2021. Investment in AI will help a wide range of organizations go through the economic crisis as they emerge from the pandemic. Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    How Air Traffic Can Be Optimized Using Artificial Intelligence
    With the fast-growing and high-density global air traffic, ensuring efficiency and air transportation safety becomes a critical challenge. AI is already revolutionizing the way air traffic management systems are manufactured and hence is believed to play a key role in optimizing air traffic flow. Read more submitted by /u/ridamughal110 [link] [comments]
    Is the advancement of AI in accordance with the best long-term interests of Humanity?
    AI advancements in digital technology are growing, and today we have far more technical capability than we had in the 90s, with the potential to expand even quicker in the future. Is, however, the continuance of AI growth in the best interests of humanity? Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    Is artificial intelligence the most powerful thing that humans have ever created?
    Artificial Intelligence is one of the most powerful things humans have been working on for decades, and its limitless magical spells are altering our lives. Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    Build ML with zero data, training or setup with Humingbird!
    submitted by /u/holamyeung [link] [comments]  ( 2 min )
    Where can I best get OPT 175B to run?
    I know I sound like a douche. I got access to the OPT 175B mode for my research, but my universitie’s GPU capabilities aren’t sufficient. Usually, I train my LLM on two local 50GB GPUs, that doesn’t seem to work now - so - what would you recommend? submitted by /u/Trick_Brain [link] [comments]  ( 1 min )
    Reinforcement learning companies
    Are there any companies (startups are fine too) doing work in Reinforcement Learning - more specifically in game NPCs/bots that you guys know? submitted by /u/Nice_Working [link] [comments]
    Please help me out in my research.
    I used the glob function, but instead of getting 33 categories I only get one, hence the output layer of the model is just 1, how could I fix it? Thank you guys in advance and all the help will be appreciated https://colab.research.google.com/drive/1NdpWmOVbqV2EUJ50vLj7nhWdWWHtbY_8?usp=sharing submitted by /u/Vector-Desperandum24 [link] [comments]  ( 1 min )
  • Open

    5 Ways to Optimize Database Performance
    Database performance allows developers or database administrators to enhance the system resources for lasting performance improvements. Databases are like the central nervous system of an application. They are responsible for the organization and function of critical processes. The minor database-related performance issues have the ability to impact the entire operation. Locating issues in the databases… Read More »5 Ways to Optimize Database Performance The post 5 Ways to Optimize Database Performance appeared first on Data Science Central.  ( 6 min )
  • Open

    AI Predicts Volcano Eruptions + Rapidly Detect Earthquakes | AI Robot Autonomously Reduces Jellyfish Overpopulation | AI Detects Pancreatic Cancer
    submitted by /u/getrich_or_diemining [link] [comments]  ( 1 min )
    Advice
    Hello, I need help with finding weights on a couple of outputs but I don't quite understand how it works fully. Could anyone help me? Its for educational purposes, I just want advice and help, not to have the work done for me. submitted by /u/106steve [link] [comments]
  • Open

    What is Extended Reality?
    Advances in extended reality have already changed the way we work, live and play, and it’s just getting started. Extended reality, or XR, is an umbrella category that covers a spectrum of newer, immersive technologies, including virtual reality, augmented reality and mixed reality. From gaming to virtual production to product design, XR has enabled people Read article > The post What is Extended Reality? appeared first on NVIDIA Blog.  ( 5 min )
    From Cloud to Car: How NIO Develops Intelligent Vehicles on NVIDIA HGX
    Building next-generation intelligent vehicles requires an AI infrastructure that pushes the cutting edge. Electric vehicle maker NIO is using NVIDIA HGX to build a comprehensive data center infrastructure for developing AI-powered, software-defined vehicles. With high-performance compute, the automaker can continuously iterate on sophisticated deep learning models, creating robust autonomous driving algorithms in a closed-loop environment. Read article > The post From Cloud to Car: How NIO Develops Intelligent Vehicles on NVIDIA HGX appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    ML for Algorithmic Trading, with Stefan Jansen
    Listen to this episode on Anchor FM. In this episode of the DATAcated podcast, host Kate Strachnyi talks with Stefan Jansen about machine… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 3 min )
    Data Science Project Vs Machine Learning Project: Python Data Science Complete Course — Part2
    Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 5 min )
  • Open

    The Berkeley Crossword Solver
    We recently built the Berkeley Crossword Solver (BCS), the first computer program to beat every human competitor in the world’s top crossword tournament. The BCS combines neural question answering and probabilistic inference to achieve near-perfect performance on most American-style crossword puzzles, like the one shown below: Crosswords are challenging for humans and computers alike. Many clues are vague or underspecified and can’t be answered until crossing constraints are taken into account. While some clues are similar to factoid question answering, others require relational reasoning or understanding difficult wordplay. Here are a handful of example clues from our dataset (answers at the bottom of this post): They’re given out at Berkeley’s HAAS School (4) Winter hrs. in Berkeley (3) …  ( 3 min )
  • Open

    Artificial intelligence predicts patients’ race from their medical images
    Study shows AI can identify self-reported race from medical images that contain no indications of race detectable by human experts.  ( 7 min )
  • Open

    Design choice and machine learning model performances. (arXiv:2201.10239v2 [stat.ML] UPDATED)
    An increasing number of publications present the joint application of Design of Experiments (DOE) and machine learning (ML) as a methodology to collect and analyze data on a specific industrial phenomenon. However, the literature shows that the choice of the design for data collection and model for data analysis is often not driven by statistical or algorithmic advantages, thus there is a lack of studies which provide guidelines on what designs and ML models to jointly use for data collection and analysis. This article discusses the choice of design in relation to the ML model performances. A study is conducted that considers 12 experimental designs, 7 families of predictive models, 7 test functions that emulate physical processes, and 8 noise settings, both homoscedastic and heteroscedastic. The results of the research can have an immediate impact on the work of practitioners, providing guidelines for practical applications of DOE and ML.
    Class-Aware Generative Adversarial Transformers for Medical Image Segmentation. (arXiv:2201.10737v3 [cs.CV] UPDATED)
    Transformers have made remarkable progress towards modeling long-range dependencies within the medical image analysis domain. However, current transformer-based models suffer from several disadvantages: (1) existing methods fail to capture the important features of the images due to the naive tokenization scheme; (2) the models suffer from information loss because they only consider single-scale feature representations; and (3) the segmentation label maps generated by the models are not accurate enough without considering rich semantic contexts and anatomical textures. In this work, we present CASTformer, a novel type of generative adversarial transformers, for 2D medical image segmentation. First, we take advantage of the pyramid structure to construct multi-scale representations and handle multi-scale variations. We then design a novel class-aware transformer module to better learn the discriminative regions of objects with semantic structures. Lastly, we utilize an adversarial training strategy that boosts segmentation accuracy and correspondingly allows a transformer-based discriminator to capture high-level semantically correlated contents and low-level anatomical features. Our experiments demonstrate that CASTformer dramatically outperforms previous state-of-the-art transformer-based approaches on three benchmarks, obtaining 2.54%-5.88% absolute improvements in Dice over previous models. Further qualitative experiments provide a more detailed picture of the model's inner workings, shed light on the challenges in improved transparency, and demonstrate that transfer learning can greatly improve performance and reduce the size of medical image datasets in training, making CASTformer a strong starting point for downstream medical image analysis tasks.
    Universal Lower Bound for Learning Causal DAGs with Atomic Interventions. (arXiv:2111.05070v4 [cs.LG] UPDATED)
    A well-studied challenge that arises in the structure learning problem of causal directed acyclic graphs (DAG) is that using observational data, one can only learn the graph up to a "Markov equivalence class" (MEC). The remaining undirected edges have to be oriented using interventions, which can be very expensive to perform in applications. Thus, the problem of minimizing the number of interventions needed to fully orient the MEC has received a lot of recent attention, and is also the focus of this work. Our first result is a new universal lower bound on the number of single-node interventions that any algorithm (whether active or passive) would need to perform in order to orient a given MEC. Our second result shows that this bound is, in fact, within a factor of two of the size of the smallest set of single-node interventions that can orient the MEC. Our lower bound is provably better than previously known lower bounds. Further, using simulations on synthetic graphs and by giving examples of special graph families, we show that our bound is often significantly better. To prove our lower bound, we develop the notion of clique-block shared-parents (CBSP) orderings, which are topological orderings of DAGs without v-structures and satisfy certain special properties. We also use the techniques developed here to extend our results to the setting of multi-node interventions.
    Unsupervised Learning of Rydberg Atom Array Phase Diagram with Siamese Neural Networks. (arXiv:2205.04051v2 [physics.comp-ph] UPDATED)
    We introduce an unsupervised machine learning method based on Siamese Neural Networks (SNN) to detect phase boundaries. This method is applied to Monte-Carlo simulations of Ising-type systems and Rydberg atom arrays. In both cases the SNN reveals phase boundaries consistent with prior research. The combination of leveraging the power of feed-forward neural networks, unsupervised learning and the ability to learn about multiple phases without knowing about their existence provides a powerful method to explore new and unknown phases of matter.
    SepTr: Separable Transformer for Audio Spectrogram Processing. (arXiv:2203.09581v2 [cs.CV] UPDATED)
    Following the successful application of vision transformers in multiple computer vision tasks, these models have drawn the attention of the signal processing community. This is because signals are often represented as spectrograms (e.g. through Discrete Fourier Transform) which can be directly provided as input to vision transformers. However, naively applying transformers to spectrograms is suboptimal. Since the axes represent distinct dimensions, i.e. frequency and time, we argue that a better approach is to separate the attention dedicated to each axis. To this end, we propose the Separable Transformer (SepTr), an architecture that employs two transformer blocks in a sequential manner, the first attending to tokens within the same frequency bin, and the second attending to tokens within the same time interval. We conduct experiments on three benchmark data sets, showing that our separable architecture outperforms conventional vision transformers and other state-of-the-art methods. Unlike standard transformers, SepTr linearly scales the number of trainable parameters with the input size, thus having a lower memory footprint. Our code is available as open source at https://github.com/ristea/septr.
    Trajectory Inference via Mean-field Langevin in Path Space. (arXiv:2205.07146v2 [math.OC] UPDATED)
    Trajectory inference aims at recovering the dynamics of a population from snapshots of its temporal marginals. To solve this task, a min-entropy estimator relative to the Wiener measure in path space was introduced by Lavenant et al. arXiv:2102.09204, and shown to consistently recover the dynamics of a large class of drift-diffusion processes from the solution of an infinite dimensional convex optimization problem. In this paper, we introduce a grid-free algorithm to compute this estimator. Our method consists in a family of point clouds (one per snapshot) coupled via Schr\"odinger bridges which evolve with noisy gradient descent. We study the mean-field limit of the dynamics and prove its global convergence at an exponential rate to the desired estimator. Overall, this leads to an inference method with end-to-end theoretical guarantees that solves an interpretable model for trajectory inference. We also present how to adapt the method to deal with mass variations, a useful extension when dealing with single cell RNA-sequencing data where cells can branch and die.
    Personalized Interventions for Online Moderation. (arXiv:2205.09462v1 [cs.SI])
    Current online moderation follows a one-size-fits-all approach, where each intervention is applied in the same way to all users. This naive approach is challenged by established socio-behavioral theories and by recent empirical results that showed the limited effectiveness of such interventions. We propose a paradigm-shift in online moderation by moving towards a personalized and user-centered approach. Our multidisciplinary vision combines state-of-the-art theories and practices in diverse fields such as computer science, sociology and psychology, to design personalized moderation interventions (PMIs). In outlining the path leading to the next-generation of moderation interventions, we also discuss the most prominent challenges introduced by such a disruptive change.
    Truncated tensor Schatten p-norm based approach for spatiotemporal traffic data imputation with complicated missing patterns. (arXiv:2205.09390v1 [stat.ML])
    Rapid advances in sensor, wireless communication, cloud computing and data science have brought unprecedented amount of data to assist transportation engineers and researchers in making better decisions. However, traffic data in reality often has corrupted or incomplete values due to detector and communication malfunctions. Data imputation is thus required to ensure the effectiveness of downstream data-driven applications. To this end, numerous tensor-based methods treating the imputation problem as the low-rank tensor completion (LRTC) have been attempted in previous works. To tackle rank minimization, which is at the core of the LRTC, most of aforementioned methods utilize the tensor nuclear norm (NN) as a convex surrogate for the minimization. However, the over-relaxation issue in NN refrains it from desirable performance in practice. In this paper, we define an innovative nonconvex truncated Schatten p-norm for tensors (TSpN) to approximate tensor rank and impute missing spatiotemporal traffic data under the LRTC framework. We model traffic data into a third-order tensor structure of (time intervals,locations (sensors),days) and introduce four complicated missing patterns, including random missing and three fiber-like missing cases according to the tensor mode-n fibers. Despite nonconvexity of the objective function in our model, we derive the global optimal solutions by integrating the alternating direction method of multipliers (ADMM) with generalized soft-thresholding (GST). In addition, we design a truncation rate decay strategy to deal with varying missing rate scenarios. Comprehensive experiments are finally conducted using real-world spatiotemporal datasets, which demonstrate that the proposed LRTC-TSpN method performs well under various missing cases, meanwhile outperforming other SOTA tensor-based imputation models in almost all scenarios.
    Evolutionary latent space search for driving human portrait generation. (arXiv:2204.11887v2 [cs.CV] UPDATED)
    This article presents an evolutionary approach for synthetic human portraits generation based on the latent space exploration of a generative adversarial network. The idea is to produce different human face images very similar to a given target portrait. The approach applies StyleGAN2 for portrait generation and FaceNet for face similarity evaluation. The evolutionary search is based on exploring the real-coded latent space of StyleGAN2. The main results over both synthetic and real images indicate that the proposed approach generates accurate and diverse solutions, which represent realistic human portraits. The proposed research can contribute to improving the security of face recognition systems.
    Approximating Persistent Homology for Large Datasets. (arXiv:2204.09155v2 [stat.ML] UPDATED)
    Persistent homology is an important methodology from topological data analysis which adapts theory from algebraic topology to data settings and has been successfully implemented in many applications. It produces a statistical summary in the form of a persistence diagram, which captures the shape and size of the data. Despite its widespread use, persistent homology is simply impossible to implement when a dataset is very large. In this paper we address the problem of finding a representative persistence diagram for prohibitively large datasets. We adapt the classical statistical method of bootstrapping, namely, drawing and studying smaller multiple subsamples from the large dataset. We show that the mean of the persistence diagrams of subsamples -- taken as a mean persistence measure computed from the subsamples -- is a valid approximation of the true persistent homology of the larger dataset. We give the rate of convergence of the mean persistence diagram to the true persistence diagram in terms of the number of subsamples and size of each subsample. Given the complex algebraic and geometric nature of persistent homology, we adapt the convexity and stability properties in the space of persistence diagrams together with random set theory to achieve our theoretical results for the general setting of point cloud data. We demonstrate our approach on simulated and real data, including an application of shape clustering on complex large-scale point cloud data.
    Causal Inference from Small High-dimensional Datasets. (arXiv:2205.09281v1 [cs.LG])
    Many methods have been proposed to estimate treatment effects with observational data. Often, the choice of the method considers the application's characteristics, such as type of treatment and outcome, confounding effect, and the complexity of the data. These methods implicitly assume that the sample size is large enough to train such models, especially the neural network-based estimators. What if this is not the case? In this work, we propose Causal-Batle, a methodology to estimate treatment effects in small high-dimensional datasets in the presence of another high-dimensional dataset in the same feature space. We adopt an approach that brings transfer learning techniques into causal inference. Our experiments show that such an approach helps to bring stability to neural network-based methods and improve the treatment effect estimates in small high-dimensional datasets.
    Attracting and Dispersing: A Simple Approach for Source-free Domain Adaptation. (arXiv:2205.04183v2 [cs.CV] UPDATED)
    We propose a simple but effective source-free domain adaptation (SFDA) method. Treating SFDA as an unsupervised clustering problem and following the intuition that local neighbors in feature space should have more similar predictions than other features, we propose to optimize an objective of prediction consistency. This objective encourages local neighborhood features in feature space to have similar predictions while features farther away in feature space have dissimilar predictions, leading to efficient feature clustering and cluster assignment simultaneously. For efficient training, we seek to optimize an upper-bound of the objective resulting in two simple terms. Furthermore, we relate popular existing methods in domain adaptation, source-free domain adaptation and contrastive learning via the perspective of discriminability and diversity. The experimental results prove the superiority of our method, and our method can be adopted as a simple but strong baseline for future research in SFDA. Our method can be also adapted to source-free open-set and partial-set DA which further shows the generalization ability of our method.
    High-resolution landscape-scale biomass mapping using a spatiotemporal patchwork of LiDAR coverages. (arXiv:2205.08530v1 [stat.AP] CROSS LISTED)
    Estimating forest aboveground biomass at fine spatial scales has become increasingly important for greenhouse gas estimation, monitoring, and verification efforts to mitigate climate change. Airborne LiDAR continues to be a valuable source of remote sensing data for estimating aboveground biomass. However airborne LiDAR collections may take place at local or regional scales covering irregular, non-contiguous footprints, resulting in a 'patchwork' of different landscape segments at different points in time. Here we addressed common obstacles including selection of training data, the investigation of regional or coverage specific patterns in bias and error, and map agreement, and model-based precision assessments at multiple scales. Three machine learning algorithms and an ensemble model were trained using field inventory data (FIA), airborne LiDAR, and topographic, climatic and cadastral geodata. Using strict selection criteria, 801 FIA plots were selected with co-located point clouds drawn from a patchwork of 17 leaf-off LiDAR coverages 2014-2019). Our ensemble model created 30m AGB prediction surfaces within a predictor-defined area of applicability (98% of LiDAR coverage) and resulting AGB predictions were compared with FIA plot-level and areal estimates at multiple scales of aggregation. Our model was overall accurate (% RMSE 13-33%), had very low bias (MBE $\leq$ $\pm$5 Mg ha$^{-1}$), explained most field-observed variation (R$^2$ 0.74-0.93), produced estimates that were both largely consistent with FIA's aggregate summaries (86% of estimates within 95% CI), as well as precise when aggregated to arbitrary small-areas (mean bootstrap standard error 0.37 Mg ha$^{-1}$). We share practical solutions to challenges faced when using spatiotemporal patchworks of LiDAR to meet growing needs for biomass prediction and mapping, and applications in carbon accounting and ecosystem stewardship.
    Multi-DNN Accelerators for Next-Generation AI Systems. (arXiv:2205.09376v1 [cs.AR])
    As the use of AI-powered applications widens across multiple domains, so do increase the computational demands. Primary driver of AI technology are the deep neural networks (DNNs). When focusing either on cloud-based systems that serve multiple AI queries from different users each with their own DNN model, or on mobile robots and smartphones employing pipelines of various models or parallel DNNs for the concurrent processing of multi-modal data, the next generation of AI systems will have multi-DNN workloads at their core. Large-scale deployment of AI services and integration across mobile and embedded systems require additional breakthroughs in the computer architecture front, with processors that can maintain high performance as the number of DNNs increases while meeting the quality-of-service requirements, giving rise to the topic of multi-DNN accelerator design.
    Simplifying Node Classification on Heterophilous Graphs with Compatible Label Propagation. (arXiv:2205.09389v1 [cs.LG])
    Graph Neural Networks (GNNs) have been predominant for graph learning tasks; however, recent studies showed that a well-known graph algorithm, Label Propagation (LP), combined with a shallow neural network can achieve comparable performance to GNNs in semi-supervised node classification on graphs with high homophily. In this paper, we show that this approach falls short on graphs with low homophily, where nodes often connect to the nodes of the opposite classes. To overcome this, we carefully design a combination of a base predictor with LP algorithm that enjoys a closed-form solution as well as convergence guarantees. Our algorithm first learns the class compatibility matrix and then aggregates label predictions using LP algorithm weighted by class compatibilities. On a wide variety of benchmarks, we show that our approach achieves the leading performance on graphs with various levels of homophily. Meanwhile, it has orders of magnitude fewer parameters and requires less execution time. Empirical evaluations demonstrate that simple adaptations of LP can be competitive in semi-supervised node classification in both homophily and heterophily regimes.
    Interpretable Latent Variables in Deep State Space Models. (arXiv:2203.02057v2 [stat.ML] UPDATED)
    We introduce a new version of deep state-space models (DSSMs) that combines a recurrent neural network with a state-space framework to forecast time series data. The model estimates the observed series as functions of latent variables that evolve non-linearly through time. Due to the complexity and non-linearity inherent in DSSMs, previous works on DSSMs typically produced latent variables that are very difficult to interpret. Our paper focus on producing interpretable latent parameters with two key modifications. First, we simplify the predictive decoder by restricting the response variables to be a linear transformation of the latent variables plus some noise. Second, we utilize shrinkage priors on the latent variables to reduce redundancy and improve robustness. These changes make the latent variables much easier to understand and allow us to interpret the resulting latent variables as random effects in a linear mixed model. We show through two public benchmark datasets the resulting model improves forecasting performances.
    Towards Applicable Reinforcement Learning: Improving the Generalization and Sample Efficiency with Policy Ensemble. (arXiv:2205.09284v1 [cs.LG])
    It is challenging for reinforcement learning (RL) algorithms to succeed in real-world applications like financial trading and logistic system due to the noisy observation and environment shifting between training and evaluation. Thus, it requires both high sample efficiency and generalization for resolving real-world tasks. However, directly applying typical RL algorithms can lead to poor performance in such scenarios. Considering the great performance of ensemble methods on both accuracy and generalization in supervised learning (SL), we design a robust and applicable method named Ensemble Proximal Policy Optimization (EPPO), which learns ensemble policies in an end-to-end manner. Notably, EPPO combines each policy and the policy ensemble organically and optimizes both simultaneously. In addition, EPPO adopts a diversity enhancement regularization over the policy space which helps to generalize to unseen states and promotes exploration. We theoretically prove EPPO increases exploration efficacy, and through comprehensive experimental evaluations on various tasks, we demonstrate that EPPO achieves higher efficiency and is robust for real-world applications compared with vanilla policy optimization algorithms and other ensemble methods. Code and supplemental materials are available at https://seqml.github.io/eppo.
    FlowPool: Pooling Graph Representations with Wasserstein Gradient Flows. (arXiv:2112.09990v2 [cs.LG] UPDATED)
    In several machine learning tasks for graph structured data, the graphs under consideration may be composed of a varying number of nodes. Therefore, it is necessary to design pooling methods that aggregate the graph representations of varying size to representations of fixed size which can be used in downstream tasks, such as graph classification. Existing graph pooling methods offer no guarantee with regards to the similarity of a graph representation and its pooled version. In this work, we address this limitation by proposing FlowPool, a pooling method that optimally preserves the statistics of a graph representation to its pooled counterpart by minimising their Wasserstein distance. This is achieved by performing a Wasserstein gradient flow with respect to the pooled graph representation. Our method relies on a versatile implementation which can take into account the geometry of the representation space through any ground cost and computes the gradient of the Wasserstein distance with automatic differentiation. We propose the differentiation of the Wasserstein flow layer using an implicit differentiation scheme. Therefore, our pooling method is amenable to automatic differentiation and can be integrated in end-to-end deep learning architectures. Further, FlowPool is invariant to permutations and can therefore be combined with permutation equivariant feature extraction layers in GNNs in order to obtain predictions that are independent of the ordering of the nodes. Experimental results demonstrate that our method leads to an increase in performance compared to existing pooling methods when evaluated on graph classification.
    Time Series Generation with Masked Autoencoder. (arXiv:2201.07006v3 [cs.LG] UPDATED)
    This paper shows that masked autoencoder with extrapolator (ExtraMAE) is a scalable self-supervised model for time series generation. ExtraMAE randomly masks some patches of the original time series and learns temporal dynamics by recovering the masked patches. Our approach has two core designs. First, ExtraMAE is self-supervised. Supervision allows ExtraMAE to effectively and efficiently capture the temporal dynamics of the original time series. Second, ExtraMAE proposes an extrapolator to disentangle two jobs of the decoder: recovering latent representations and mapping them back into the feature space. These unique designs enable ExtraMAE to consistently and significantly outperform state-of-the-art (SoTA) benchmarks in time series generation. The lightweight architecture also makes ExtraMAE fast and scalable. ExtraMAE shows outstanding behavior in various downstream tasks such as time series classification, prediction, and imputation. As a self-supervised generative model, ExtraMAE allows explicit management of the synthetic data. We hope this paper will usher in a new era of time series generation with self-supervised models.
    Speeding Up Entmax. (arXiv:2111.06832v3 [cs.CL] UPDATED)
    Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being selected at each generation step, leading to a variety of reported problems in text generation. $\alpha$-entmax of Peters et al. (2019, arXiv:1905.05702) solves this problem, but is considerably slower than softmax. In this paper, we propose an alternative to $\alpha$-entmax, which keeps its virtuous characteristics, but is as fast as optimized softmax and achieves on par or better performance in machine translation task.
    Theory of Acceleration of Decision Making by Correlated Time Sequences. (arXiv:2203.16004v3 [cs.LG] UPDATED)
    Photonic accelerators have been intensively studied to provide enhanced information processing capability to benefit from the unique attributes of physical processes. Recently, it has been reported that chaotically oscillating ultrafast time series from a laser, called laser chaos, provides the ability to solve multi-armed bandit (MAB) problems or decision-making problems at GHz order. Furthermore, it has been confirmed that the negatively correlated time-domain structure of laser chaos contributes to the acceleration of decision-making. However, the underlying mechanism of why decision-making is accelerated by correlated time series is unknown. In this paper, we demonstrate a theoretical model to account for the acceleration of decision-making by correlated time sequence. We first confirm the effectiveness of the negative autocorrelation inherent in time series for solving two-armed bandit problems using Fourier transform surrogate methods. We propose a theoretical model that concerns the correlated time series subjected to the decision-making system and the internal status of the system therein in a unified manner, inspired by correlated random walks. We demonstrate that the performance derived analytically by the theory agrees well with the numerical simulations, which confirms the validity of the proposed model and leads to optimal system design. The present study paves the new way for the effectiveness of correlated time series for decision-making, impacting artificial intelligence and other applications.
    Overcoming challenges in leveraging GANs for few-shot data augmentation. (arXiv:2203.16662v2 [stat.ML] UPDATED)
    In this paper, we explore the use of GAN-based few-shot data augmentation as a method to improve few-shot classification performance. We perform an exploration into how a GAN can be fine-tuned for such a task (one of which is in a class-incremental manner), as well as a rigorous empirical investigation into how well these models can perform to improve few-shot classification. We identify issues related to the difficulty of training such generative models under a purely supervised regime with very few examples, as well as issues regarding the evaluation protocols of existing works. We also find that in this regime, classification accuracy is highly sensitive to how the classes of the dataset are randomly split. Therefore, we propose a semi-supervised fine-tuning approach as a more pragmatic way forward to address these problems.
    Deep Dynamic Effective Connectivity Estimation from Multivariate Time Series. (arXiv:2202.02393v3 [cs.LG] UPDATED)
    Recently, methods that represent data as a graph, such as graph neural networks (GNNs) have been successfully used to learn data representations and structures to solve classification and link prediction problems. The applications of such methods are vast and diverse, but most of the current work relies on the assumption of a static graph. This assumption does not hold for many highly dynamic systems, where the underlying connectivity structure is non-stationary and is mostly unobserved. Using a static model in these situations may result in sub-optimal performance. In contrast, modeling changes in graph structure with time can provide information about the system whose applications go beyond classification. Most work of this type does not learn effective connectivity and focuses on cross-correlation between nodes to generate undirected graphs. An undirected graph is unable to capture direction of an interaction which is vital in many fields, including neuroscience. To bridge this gap, we developed dynamic effective connectivity estimation via neural network training (DECENNT), a novel model to learn an interpretable directed and dynamic graph induced by the downstream classification/prediction task. DECENNT outperforms state-of-the-art (SOTA) methods on five different tasks and infers interpretable task-specific dynamic graphs. The dynamic graphs inferred from functional neuroimaging data align well with the existing literature and provide additional information. Additionally, the temporal attention module of DECENNT identifies time-intervals crucial for predictive downstream task from multivariate time series data.
    Backdoor Detection in Reinforcement Learning. (arXiv:2202.03609v2 [cs.LG] UPDATED)
    While the real world application of reinforcement learning (RL) is becoming popular, the safety concern and the robustness of an RL system require more attention. A recent work reveals that, in a multi-agent RL environment, backdoor trigger actions can be injected into a victim agent (a.k.a. trojan agent), which can result in a catastrophic failure as soon as it sees the backdoor trigger action. We propose the problem of RL Backdoor Detection, aiming to address this safety vulnerability. An interesting observation we drew from extensive empirical studies is a trigger smoothness property where normal actions similar to the backdoor trigger actions can also trigger low performance of the trojan agent. Inspired by this observation, we propose a reinforcement learning solution TrojanSeeker to find approximate trigger actions for the trojan agents, and further propose an efficient approach to mitigate the trojan agents based on machine unlearning. Experiments show that our approach can correctly distinguish and mitigate all the trojan agents across various types of agents and environments.
    STOPS: Short-Term-based Volatility-controlled Policy Search and its Global Convergence. (arXiv:2201.09857v4 [cs.LG] UPDATED)
    It remains challenging to deploy existing risk-averse approaches to real-world applications. The reasons are multi-fold, including the lack of global optimality guarantee and the necessity of learning from long-term consecutive trajectories. Long-term consecutive trajectories are prone to involving visiting hazardous states, which is a major concern in the risk-averse setting. This paper proposes Short-Term VOlatility-controlled Policy Search (STOPS), a novel algorithm that solves risk-averse problems by learning from short-term trajectories instead of long-term trajectories. Short-term trajectories are more flexible to generate, and can avoid the danger of hazardous state visitations. By using an actor-critic scheme with an overparameterized two-layer neural network, our algorithm finds a globally optimal policy at a sublinear rate with proximal policy optimization and natural policy gradient, with effectiveness comparable to the state-of-the-art convergence rate of risk-neutral policy-search methods. The algorithm is evaluated on challenging Mujoco robot simulation tasks under the mean-variance evaluation metric. Both theoretical analysis and experimental results demonstrate a state-of-the-art level of STOPS' performance among existing risk-averse policy search methods.
    BR-NPA: A Non-Parametric High-Resolution Attention Model to improve the Interpretability of Attention. (arXiv:2106.02566v4 [cs.CV] UPDATED)
    The prevalence of employing attention mechanisms has brought along concerns on the interpretability of attention distributions. Although it provides insights about how a model is operating, utilizing attention as the explanation of model predictions is still highly dubious. The community is still seeking more interpretable strategies for better identifying local active regions that contribute the most to the final decision. To improve the interpretability of existing attention models, we propose a novel Bilinear Representative Non-Parametric Attention (BR-NPA) strategy that captures the task-relevant human-interpretable information. The target model is first distilled to have higher-resolution intermediate feature maps. From which, representative features are then grouped based on local pairwise feature similarity, to produce finer-grained, more precise attention maps highlighting task-relevant parts of the input. The obtained attention maps are ranked according to the activity level of the compound feature, which provides information regarding the important level of the highlighted regions. The proposed model can be easily adapted in a wide variety of modern deep models, where classification is involved. Extensive quantitative and qualitative experiments showcase more comprehensive and accurate visual explanations compared to state-of-the-art attention models and visualizations methods across multiple tasks including fine-grained image classification, few-shot classification, and person re-identification, without compromising the classification accuracy. The proposed visualization model sheds imperative light on how neural networks `pay their attention' differently in different tasks.
    k-strip: A novel segmentation algorithm in k-space for the application of skull stripping. (arXiv:2205.09706v1 [eess.IV])
    Objectives: Present a novel deep learning-based skull stripping algorithm for magnetic resonance imaging (MRI) that works directly in the information rich k-space. Materials and Methods: Using two datasets from different institutions with a total of 36,900 MRI slices, we trained a deep learning-based model to work directly with the complex raw k-space data. Skull stripping performed by HD-BET (Brain Extraction Tool) in the image domain were used as the ground truth. Results: Both datasets were very similar to the ground truth (DICE scores of 92\%-98\% and Hausdorff distances of under 5.5 mm). Results on slices above the eye-region reach DICE scores of up to 99\%, while the accuracy drops in regions around the eyes and below, with partially blurred output. The output of k-strip often smoothed edges at the demarcation to the skull. Binary masks are created with an appropriate threshold. Conclusion: With this proof-of-concept study, we were able to show the feasibility of working in the k-space frequency domain, preserving phase information, with consistent results. Future research should be dedicated to discovering additional ways the k-space can be used for innovative image analysis and further workflows.
    Challenges in Deploying Machine Learning: a Survey of Case Studies. (arXiv:2011.09926v3 [cs.LG] UPDATED)
    In recent years, machine learning has transitioned from a field of academic research interest to a field capable of solving real-world business problems. However, the deployment of machine learning models in production systems can present a number of issues and concerns. This survey reviews published reports of deploying machine learning solutions in a variety of use cases, industries and applications and extracts practical considerations corresponding to stages of the machine learning deployment workflow. By mapping found challenges to the steps of the machine learning deployment workflow we show that practitioners face issues at each stage of the deployment process. The goal of this paper is to lay out a research agenda to explore approaches addressing these challenges.
    Cross-modal Learning of Graph Representations using Radar Point Cloud for Long-Range Gesture Recognition. (arXiv:2203.17066v2 [eess.SP] UPDATED)
    Gesture recognition is one of the most intuitive ways of interaction and has gathered particular attention for human computer interaction. Radar sensors possess multiple intrinsic properties, such as their ability to work in low illumination, harsh weather conditions, and being low-cost and compact, making them highly preferable for a gesture recognition solution. However, most literature work focuses on solutions with a limited range that is lower than a meter. We propose a novel architecture for a long-range (1m - 2m) gesture recognition solution that leverages a point cloud-based cross-learning approach from camera point cloud to 60-GHz FMCW radar point cloud, which allows learning better representations while suppressing noise. We use a variant of Dynamic Graph CNN (DGCNN) for the cross-learning, enabling us to model relationships between the points at a local and global level and to model the temporal dynamics a Bi-LSTM network is employed. In the experimental results section, we demonstrate our model's overall accuracy of 98.4% for five gestures and its generalization capability.
    Heterogeneous Multi-task Learning with Expert Diversity. (arXiv:2106.10595v2 [cs.LG] UPDATED)
    Predicting multiple heterogeneous biological and medical targets is a challenge for traditional deep learning models. In contrast to single-task learning, in which a separate model is trained for each target, multi-task learning (MTL) optimizes a single model to predict multiple related targets simultaneously. To address this challenge, we propose the Multi-gate Mixture-of-Experts with Exclusivity (MMoEEx). Our work aims to tackle the heterogeneous MTL setting, in which the same model optimizes multiple tasks with different characteristics. Such a scenario can overwhelm current MTL approaches due to the challenges in balancing shared and task-specific representations and the need to optimize tasks with competing optimization paths. Our method makes two key contributions: first, we introduce an approach to induce more diversity among experts, thus creating representations more suitable for highly imbalanced and heterogenous MTL learning; second, we adopt a two-step optimization [6, 11] approach to balancing the tasks at the gradient level. We validate our method on three MTL benchmark datasets, including Medical Information Mart for Intensive Care (MIMIC-III) and PubChem BioAssay (PCBA).
    On a class of data-driven mixed-integer programming problems under uncertainty: a distributionally robust approach. (arXiv:2105.14139v3 [math.OC] UPDATED)
    In this study we analyze linear mixed-integer programming problems, in which the distribution of the cost vector is only observable through a finite training data set. In contrast to the related studies, we assume that the number of random observations for each component of the cost vector may vary. Then the goal is to find a prediction rule that converts the data set into an estimate of the expected value of the objective function and a prescription rule that provides an associated estimate of the optimal decision. We aim at finding the least conservative prediction and prescription rules, which satisfy some specified asymptotic guarantees as the sample size tends to infinity. We demonstrate that under some mild assumption the resulting vector optimization problems admit a Pareto optimal solution with some attractive theoretical properties. In particular, this solution can be obtained by solving a distributionally robust optimization (DRO) problem with respect to all probability distributions with given component-wise relative entropy distances from the empirical marginal distributions. It turns out that the outlined DRO problem can be solved rather effectively whenever there exists an effective algorithm for the respective deterministic problem. In addition, we perform numerical experiments where the out-of-sample performance of the proposed approach is analyzed.
    Scalable Multi-view Clustering with Graph Filtering. (arXiv:2205.09228v1 [cs.LG])
    With the explosive growth of multi-source data, multi-view clustering has attracted great attention in recent years. Most existing multi-view methods operate in raw feature space and heavily depend on the quality of original feature representation. Moreover, they are often designed for feature data and ignore the rich topology structure information. Accordingly, in this paper, we propose a generic framework to cluster both attribute and graph data with heterogeneous features. It is capable of exploring the interplay between feature and structure. Specifically, we first adopt graph filtering technique to eliminate high-frequency noise to achieve a clustering-friendly smooth representation. To handle the scalability challenge, we develop a novel sampling strategy to improve the quality of anchors. Extensive experiments on attribute and graph benchmarks demonstrate the superiority of our approach with respect to state-of-the-art approaches.
    Improving Robustness against Real-World and Worst-Case Distribution Shifts through Decision Region Quantification. (arXiv:2205.09619v1 [cs.LG])
    The reliability of neural networks is essential for their use in safety-critical applications. Existing approaches generally aim at improving the robustness of neural networks to either real-world distribution shifts (e.g., common corruptions and perturbations, spatial transformations, and natural adversarial examples) or worst-case distribution shifts (e.g., optimized adversarial examples). In this work, we propose the Decision Region Quantification (DRQ) algorithm to improve the robustness of any differentiable pre-trained model against both real-world and worst-case distribution shifts in the data. DRQ analyzes the robustness of local decision regions in the vicinity of a given data point to make more reliable predictions. We theoretically motivate the DRQ algorithm by showing that it effectively smooths spurious local extrema in the decision surface. Furthermore, we propose an implementation using targeted and untargeted adversarial attacks. An extensive empirical evaluation shows that DRQ increases the robustness of adversarially and non-adversarially trained models against real-world and worst-case distribution shifts on several computer vision benchmark datasets.
    Robust and Efficient Medical Imaging with Self-Supervision. (arXiv:2205.09723v1 [cs.CV])
    Recent progress in Medical Artificial Intelligence (AI) has delivered systems that can reach clinical expert level performance. However, such systems tend to demonstrate sub-optimal "out-of-distribution" performance when evaluated in clinical settings different from the training environment. A common mitigation strategy is to develop separate systems for each clinical setting using site-specific data [1]. However, this quickly becomes impractical as medical data is time-consuming to acquire and expensive to annotate [2]. Thus, the problem of "data-efficient generalization" presents an ongoing difficulty for Medical AI development. Although progress in representation learning shows promise, their benefits have not been rigorously studied, specifically for out-of-distribution settings. To meet these challenges, we present REMEDIS, a unified representation learning strategy to improve robustness and data-efficiency of medical imaging AI. REMEDIS uses a generic combination of large-scale supervised transfer learning with self-supervised learning and requires little task-specific customization. We study a diverse range of medical imaging tasks and simulate three realistic application scenarios using retrospective data. REMEDIS exhibits significantly improved in-distribution performance with up to 11.5% relative improvement in diagnostic accuracy over a strong supervised baseline. More importantly, our strategy leads to strong data-efficient generalization of medical imaging AI, matching strong supervised baselines using between 1% to 33% of retraining data across tasks. These results suggest that REMEDIS can significantly accelerate the life-cycle of medical imaging AI development thereby presenting an important step forward for medical imaging AI to deliver broad impact.
    Overcoming Language Disparity in Online Content Classification with Multimodal Learning. (arXiv:2205.09744v1 [cs.LG])
    Advances in Natural Language Processing (NLP) have revolutionized the way researchers and practitioners address crucial societal problems. Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks. However, the development of advanced computational techniques and resources is disproportionately focused on the English language, sidelining a majority of the languages spoken globally. While existing research has developed better multilingual and monolingual language models to bridge this language disparity between English and non-English languages, we explore the promise of incorporating the information contained in images via multimodal machine learning. Our comparative analyses on three detection tasks focusing on crisis information, fake news, and emotion recognition, as well as five high-resource non-English languages, demonstrate that: (a) detection frameworks based on pre-trained large language models like BERT and multilingual-BERT systematically perform better on the English language compared against non-English languages, and (b) including images via multimodal learning bridges this performance gap. We situate our findings with respect to existing work on the pitfalls of large language models, and discuss their theoretical and practical implications. Resources for this paper are available at https://multimodality-language-disparity.github.io/.
    A Causal Bandit Approach to Learning Good Atomic Interventions in Presence of Unobserved Confounders. (arXiv:2107.02772v2 [cs.LG] UPDATED)
    We study the problem of determining the best intervention in a Causal Bayesian Network (CBN) specified only by its causal graph. We model this as a stochastic multi-armed bandit (MAB) problem with side-information, where the interventions correspond to the arms of the bandit instance. First, we propose a simple regret minimization algorithm that takes as input a semi-Markovian causal graph with atomic interventions and possibly unobservable variables, and achieves $\tilde{O}(\sqrt{M/T})$ expected simple regret, where $M$ is dependent on the input CBN and could be very small compared to the number of arms. We also show that this is almost optimal for CBNs described by causal graphs having an $n$-ary tree structure. Our simple regret minimization results, both upper and lower bound, subsume previous results in the literature, which assumed additional structural restrictions on the input causal graph. In particular, our results indicate that the simple regret guarantee of our proposed algorithm can only be improved by considering more nuanced structural restrictions on the causal graph. Next, we propose a cumulative regret minimization algorithm that takes as input a general causal graph with all observable nodes and atomic interventions and performs better than the optimal MAB algorithm that does not take causal side-information into account. We also experimentally compare both our algorithms with the best known algorithms in the literature. To the best of our knowledge, this work gives the first simple and cumulative regret minimization algorithms for CBNs with general causal graphs under atomic interventions and having unobserved confounders.
    Diverse Weight Averaging for Out-of-Distribution Generalization. (arXiv:2205.09739v1 [cs.CV])
    Standard neural networks struggle to generalize under distribution shifts. For out-of-distribution generalization in computer vision, the best current approach averages the weights along a training run. In this paper, we propose Diverse Weight Averaging (DiWA) that makes a simple change to this strategy: DiWA averages the weights obtained from several independent training runs rather than from a single run. Perhaps surprisingly, averaging these weights performs well under soft constraints despite the network's nonlinearities. The main motivation behind DiWA is to increase the functional diversity across averaged models. Indeed, models obtained from different runs are more diverse than those collected along a single run thanks to differences in hyperparameters and training procedures. We motivate the need for diversity by a new bias-variance-covariance-locality decomposition of the expected error, exploiting similarities between DiWA and standard functional ensembling. Moreover, this decomposition highlights that DiWA succeeds when the variance term dominates, which we show happens when the marginal distribution changes at test time. Experimentally, DiWA consistently improves the state of the art on the competitive DomainBed benchmark without inference overhead.
    Bayesian Network Structure Learning using Digital Annealer. (arXiv:2006.06926v3 [cs.LG] UPDATED)
    Annealing processors, which solve a quadratic unconstrained binary optimization (QUBO), are a potential breakthrough in improving the accuracy of score-based Bayesian network structure learning. However, currently, the bit capacity of an annealing processor is very limited. To utilize the power of annealing processors, it is necessary to encode score-based learning problems into QUBO within the upper bound of bits. In this paper, we propose a novel approach with the decomposition of candidate parent sets. Experimental results on benchmark networks with $37$ to $223$ variables show that our approach requires lesser bits than the bit capacity of the fourth-generation Fujitsu Digital Annealer, a fully coupled annealing processor developed with semiconductor technology. Moreover, we demonstrate that the Digital Annealer with our conversion method outperforms existing algorithms on some benchmark networks. It is expected that our approach promotes the utility of annealing processors in learning the Bayesian network.
    Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis. (arXiv:2205.09702v1 [cs.LG])
    Graph neural networks (GNNs) are among the most powerful tools in deep learning. They routinely solve complex problems on unstructured networks, such as node classification, graph classification, or link prediction, with high accuracy. However, both inference and training of GNNs are complex, and they uniquely combine the features of irregular graph processing with dense and regular computations. This complexity makes it very challenging to execute GNNs efficiently on modern massively parallel architectures. To alleviate this, we first design a taxonomy of parallelism in GNNs, considering data and model parallelism, and different forms of pipelining. Then, we use this taxonomy to investigate the amount of parallelism in numerous GNN models, GNN-driven machine learning tasks, software frameworks, or hardware accelerators. We use the work-depth model, and we also assess communication volume and synchronization. We specifically focus on the sparsity/density of the associated tensors, in order to understand how to effectively apply techniques such as vectorization. We also formally analyze GNN pipelining, and we generalize the established Message-Passing class of GNN models to cover arbitrary pipeline depths, facilitating future optimizations. Finally, we investigate different forms of asynchronicity, navigating the path for future asynchronous parallel GNN pipelines. The outcomes of our analysis are synthesized in a set of insights that help to maximize GNN performance, and a comprehensive list of challenges and opportunities for further research into efficient GNN computations. Our work will help to advance the design of future GNNs.
    Flexible Modeling and Multitask Learning using Differentiable Tree Ensembles. (arXiv:2205.09717v1 [cs.LG])
    Decision tree ensembles are widely used and competitive learning models. Despite their success, popular toolkits for learning tree ensembles have limited modeling capabilities. For instance, these toolkits support a limited number of loss functions and are restricted to single task learning. We propose a flexible framework for learning tree ensembles, which goes beyond existing toolkits to support arbitrary loss functions, missing responses, and multi-task learning. Our framework builds on differentiable (a.k.a. soft) tree ensembles, which can be trained using first-order methods. However, unlike classical trees, differentiable trees are difficult to scale. We therefore propose a novel tensor-based formulation of differentiable trees that allows for efficient vectorization on GPUs. We perform experiments on a collection of 28 real open-source and proprietary datasets, which demonstrate that our framework can lead to 100x more compact and 23% more expressive tree ensembles than those by popular toolkits.
    Spherical Perspective on Learning with Normalization Layers. (arXiv:2006.13382v3 [cs.LG] UPDATED)
    Normalization Layers (NLs) are widely used in modern deep-learning architectures. Despite their apparent simplicity, their effect on optimization is not yet fully understood. This paper introduces a spherical framework to study the optimization of neural networks with NLs from a geometric perspective. Concretely, the radial invariance of groups of parameters, such as filters for convolutional neural networks, allows to translate the optimization steps on the $L_2$ unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. Firstly, the first effective learning rate expression of Adam is derived. Then the demonstration that, in the presence of NLs, performing Stochastic Gradient Descent (SGD) alone is actually equivalent to a variant of Adam constrained to the unit hypersphere, stems from the framework. Finally, this analysis outlines phenomena that previous variants of Adam act on and their importance in the optimization process are experimentally validated.
    Provably Precise, Succinct and Efficient Explanations for Decision Trees. (arXiv:2205.09569v1 [cs.AI])
    Decision trees (DTs) embody interpretable classifiers. DTs have been advocated for deployment in high-risk applications, but also for explaining other complex classifiers. Nevertheless, recent work has demonstrated that predictions in DTs ought to be explained with rigorous approaches. Although rigorous explanations can be computed in polynomial time for DTs, their size may be beyond the cognitive limits of human decision makers. This paper investigates the computation of {\delta}-relevant sets for DTs. {\delta}-relevant sets denote explanations that are succinct and provably precise. These sets represent generalizations of rigorous explanations, which are precise with probability one, and so they enable trading off explanation size for precision. The paper proposes two logic encodings for computing smallest {\delta}-relevant sets for DTs. The paper further devises a polynomial-time algorithm for computing {\delta}-relevant sets which are not guaranteed to be subset-minimal, but for which the experiments show to be most often subset-minimal in practice. The experimental results also demonstrate the practical efficiency of computing smallest {\delta}-relevant sets.
    Metrics of calibration for probabilistic predictions. (arXiv:2205.09680v1 [math.ST])
    Predictions are often probabilities; e.g., a prediction could be for precipitation tomorrow, but with only a 30% chance. Given such probabilistic predictions together with the actual outcomes, "reliability diagrams" help detect and diagnose statistically significant discrepancies -- so-called "miscalibration" -- between the predictions and the outcomes. The canonical reliability diagrams histogram the observed and expected values of the predictions; replacing the hard histogram binning with soft kernel density estimation is another common practice. But, which widths of bins or kernels are best? Plots of the cumulative differences between the observed and expected values largely avoid this question, by displaying miscalibration directly as the slopes of secant lines for the graphs. Slope is easy to perceive with quantitative precision, even when the constant offsets of the secant lines are irrelevant; there is no need to bin or perform kernel density estimation. The existing standard metrics of miscalibration each summarize a reliability diagram as a single scalar statistic. The cumulative plots naturally lead to scalar metrics for the deviation of the graph of cumulative differences away from zero; good calibration corresponds to a horizontal, flat graph which deviates little from zero. The cumulative approach is currently unconventional, yet offers many favorable statistical properties, guaranteed via mathematical theory backed by rigorous proofs and illustrative numerical examples. In particular, metrics based on binning or kernel density estimation unavoidably must trade-off statistical confidence for the ability to resolve variations as a function of the predicted probability or vice versa. Widening the bins or kernels averages away random noise while giving up some resolving power. Narrowing the bins or kernels enhances resolving power while not averaging away as much noise.
    Dexterous Robotic Manipulation using Deep Reinforcement Learning and Knowledge Transfer for Complex Sparse Reward-based Tasks. (arXiv:2205.09683v1 [cs.RO])
    This paper describes a deep reinforcement learning (DRL) approach that won Phase 1 of the Real Robot Challenge (RRC) 2021, and then extends this method to a more difficult manipulation task. The RRC consisted of using a TriFinger robot to manipulate a cube along a specified positional trajectory, but with no requirement for the cube to have any specific orientation. We used a relatively simple reward function, a combination of goal-based sparse reward and distance reward, in conjunction with Hindsight Experience Replay (HER) to guide the learning of the DRL agent (Deep Deterministic Policy Gradient (DDPG)). Our approach allowed our agents to acquire dexterous robotic manipulation strategies in simulation. These strategies were then applied to the real robot and outperformed all other competition submissions, including those using more traditional robotic control techniques, in the final evaluation stage of the RRC. Here we extend this method, by modifying the task of Phase 1 of the RRC to require the robot to maintain the cube in a particular orientation, while the cube is moved along the required positional trajectory. The requirement to also orient the cube makes the agent unable to learn the task through blind exploration due to increased problem complexity. To circumvent this issue, we make novel use of a Knowledge Transfer (KT) technique that allows the strategies learned by the agent in the original task (which was agnostic to cube orientation) to be transferred to this task (where orientation matters). KT allowed the agent to learn and perform the extended task in the simulator, which improved the average positional deviation from 0.134 m to 0.02 m, and average orientation deviation from 142{\deg} to 76{\deg} during evaluation. This KT concept shows good generalisation properties and could be applied to any actor-critic learning algorithm.
    Neural network topological snake models for locating general phase diagrams. (arXiv:2205.09699v1 [cond-mat.stat-mech])
    Machine learning for locating phase diagram has received intensive research interest in recent years. However, its application in automatically locating phase diagram is limited to single closed phase boundary. In this paper, in order to locate phase diagrams with multiple phases and complex boundaries, we introduce (i) a network-shaped snake model and (ii) a topologically transformable snake with discriminative cooperative networks, respectively. The phase diagrams of both quantum and classical spin-1 model are obtained. Our method is flexible to determine the phase diagram with just snapshots of configurations from the cold-atom or other experiments.
    Semi-WTC: A Practical Semi-supervised Framework for Attack Categorization through Weight-Task Consistency. (arXiv:2205.09669v1 [cs.CR])
    Supervised learning has been widely used for attack detection, which requires large amounts of high-quality data and labels. However, the data is often imbalanced and sufficient annotations are difficult to obtain. Moreover, these supervised models are subject to real-world deployment issues, such as defending against unseen artificial attacks. We propose a semi-supervised fine-grained attack categorization framework consisting of an encoder and a two-branch structure to integrate information from labeled and unlabeled data to tackle these practical challenges. This framework can be generalized to different supervised models. The multilayer perceptron with residual connection and batch normalization is used as the encoder to extract features and reduce the complexity. The Recurrent Prototype Module (RPM) is proposed to train the encoder effectively in a semi-supervised manner. To alleviate the problem of data imbalance, we introduce the Weight-Task Consistency (WTC) into the iterative process of RPM by assigning larger weights to classes with fewer samples in the loss function. In addition, to cope with new attacks in real-world deployment, we further propose an Active Adaption Resampling (AAR) method, which can better discover the distribution of the unseen sample data and adapt the parameters of the encoder. Experimental results show that our model outperforms the state-of-the-art semi-supervised attack detection methods with a general 5% improvement in classification accuracy and a 90% reduction in training time.
    Discovering Dynamic Functional Brain Networks via Spatial and Channel-wise Attention. (arXiv:2205.09576v1 [cs.CV])
    Using deep learning models to recognize functional brain networks (FBNs) in functional magnetic resonance imaging (fMRI) has been attracting increasing interest recently. However, most existing work focuses on detecting static FBNs from entire fMRI signals, such as correlation-based functional connectivity. Sliding-window is a widely used strategy to capture the dynamics of FBNs, but it is still limited in representing intrinsic functional interactive dynamics at each time step. And the number of FBNs usually need to be set manually. More over, due to the complexity of dynamic interactions in brain, traditional linear and shallow models are insufficient in identifying complex and spatially overlapped FBNs across each time step. In this paper, we propose a novel Spatial and Channel-wise Attention Autoencoder (SCAAE) for discovering FBNs dynamically. The core idea of SCAAE is to apply attention mechanism to FBNs construction. Specifically, we designed two attention modules: 1) spatial-wise attention (SA) module to discover FBNs in the spatial domain and 2) a channel-wise attention (CA) module to weigh the channels for selecting the FBNs automatically. We evaluated our approach on ADHD200 dataset and our results indicate that the proposed SCAAE method can effectively recover the dynamic changes of the FBNs at each fMRI time step, without using sliding windows. More importantly, our proposed hybrid attention modules (SA and CA) do not enforce assumptions of linearity and independence as previous methods, and thus provide a novel approach to better understanding dynamic functional brain networks.
    Jacobian Granger Causal Neural Networks for Analysis of Stationary and Nonstationary Data. (arXiv:2205.09573v1 [cs.LG])
    Granger causality is a commonly used method for uncovering information flow and dependencies in a time series. Here we introduce JGC (Jacobian Granger Causality), a neural network-based approach to Granger causality using the Jacobian as a measure of variable importance, and propose a thresholding procedure for inferring Granger causal variables using this measure. The resulting approach performs consistently well compared to other approaches in identifying Granger causal variables, the associated time lags, as well as interaction signs. Lastly, through the inclusion of a time variable, we show that this approach is able to learn the temporal dependencies for nonstationary systems whose Granger causal structures change in time.
    Detect Professional Malicious User with Metric Learning in Recommender Systems. (arXiv:2205.09673v1 [cs.IR])
    In e-commerce, online retailers are usually suffering from professional malicious users (PMUs), who utilize negative reviews and low ratings to their consumed products on purpose to threaten the retailers for illegal profits. Specifically, there are three challenges for PMU detection: 1) professional malicious users do not conduct any abnormal or illegal interactions (they never concurrently leave too many negative reviews and low ratings at the same time), and they conduct masking strategies to disguise themselves. Therefore, conventional outlier detection methods are confused by their masking strategies. 2) the PMU detection model should take both ratings and reviews into consideration, which makes PMU detection a multi-modal problem. 3) there are no datasets with labels for professional malicious users in public, which makes PMU detection an unsupervised learning problem. To this end, we propose an unsupervised multi-modal learning model: MMD, which employs Metric learning for professional Malicious users Detection with both ratings and reviews. MMD first utilizes a modified RNN to project the informational review into a sentiment score, which jointly considers the ratings and reviews. Then professional malicious user profiling (MUP) is proposed to catch the sentiment gap between sentiment scores and ratings. MUP filters the users and builds a candidate PMU set. We apply a metric learning-based clustering to learn a proper metric matrix for PMU detection. Finally, we can utilize this metric and labeled users to detect PMUs. Specifically, we apply the attention mechanism in metric learning to improve the model's performance. The extensive experiments in four datasets demonstrate that our proposed method can solve this unsupervised detection problem. Moreover, the performance of the state-of-the-art recommender models is enhanced by taking MMD as a preprocessing stage.
    Are Graph Representation Learning Methods Robust to Graph Sparsity and Asymmetric Node Information?. (arXiv:2205.09648v1 [cs.LG])
    The growing popularity of Graph Representation Learning (GRL) methods has resulted in the development of a large number of models applied to a miscellany of domains. Behind this diversity of domains, there is a strong heterogeneity of graphs, making it difficult to estimate the expected performance of a model on a new graph, especially when the graph has distinctive characteristics that have not been encountered in the benchmark yet. To address this, we have developed an experimental pipeline, to assess the impact of a given property on the models performances. In this paper, we use this pipeline to study the effect of two specificities encountered on banks transactional graphs resulting from the partial view a bank has on all the individuals and transactions carried out on the market. These specific features are graph sparsity and asymmetric node information. This study demonstrates the robustness of GRL methods to these distinctive characteristics. We believe that this work can ease the evaluation of GRL methods to specific characteristics and foster the development of such methods on transactional graphs.
    HyperAid: Denoising in hyperbolic spaces for tree-fitting and hierarchical clustering. (arXiv:2205.09721v1 [cs.LG])
    The problem of fitting distances by tree-metrics has received significant attention in the theoretical computer science and machine learning communities alike, due to many applications in natural language processing, phylogeny, cancer genomics and a myriad of problem areas that involve hierarchical clustering. Despite the existence of several provably exact algorithms for tree-metric fitting of data that inherently obeys tree-metric constraints, much less is known about how to best fit tree-metrics for data whose structure moderately (or substantially) differs from a tree. For such noisy data, most available algorithms perform poorly and often produce negative edge weights in representative trees. Furthermore, it is currently not known how to choose the most suitable approximation objective for noisy fitting. Our contributions are as follows. First, we propose a new approach to tree-metric denoising (HyperAid) in hyperbolic spaces which transforms the original data into data that is ``more'' tree-like, when evaluated in terms of Gromov's $\delta$ hyperbolicity. Second, we perform an ablation study involving two choices for the approximation objective, $\ell_p$ norms and the Dasgupta loss. Third, we integrate HyperAid with schemes for enforcing nonnegative edge-weights. As a result, the HyperAid platform outperforms all other existing methods in the literature, including Neighbor Joining (NJ), TreeRep and T-REX, both on synthetic and real-world data. Synthetic data is represented by edge-augmented trees and shortest-distance metrics while the real-world datasets include Zoo, Iris, Glass, Segmentation and SpamBase; on these datasets, the average improvement with respect to NJ is $125.94\%$.
    Extract Dynamic Information To Improve Time Series Modeling: a Case Study with Scientific Workflow. (arXiv:2205.09703v1 [cs.LG])
    In modeling time series data, we often need to augment the existing data records to increase the modeling accuracy. In this work, we describe a number of techniques to extract dynamic information about the current state of a large scientific workflow, which could be generalized to other types of applications. The specific task to be modeled is the time needed for transferring a file from an experimental facility to a data center. The key idea of our approach is to find recent past data transfer events that match the current event in some ways. Tests showed that we could identify recent events matching some recorded properties and reduce the prediction error by about 12% compared to the similar models with only static features. We additionally explored an application specific technique to extract information about the data production process, and was able to reduce the average prediction error by 44%.
    What killed the Convex Booster ?. (arXiv:2205.09628v1 [cs.LG])
    A landmark negative result of Long and Servedio established a worst-case spectacular failure of a supervised learning trio (loss, algorithm, model) otherwise praised for its high precision machinery. Hundreds of papers followed up on the two suspected culprits: the loss (for being convex) and/or the algorithm (for fitting a classical boosting blueprint). Here, we call to the half-century+ founding theory of losses for class probability estimation (properness), an extension of Long and Servedio's results and a new general boosting algorithm to demonstrate that the real culprit in their specific context was in fact the (linear) model class. We advocate for a more general stanpoint on the problem as we argue that the source of the negative result lies in the dark side of a pervasive -- and otherwise prized -- aspect of ML: \textit{parameterisation}.
    EXACT: How to Train Your Accuracy. (arXiv:2205.09615v1 [cs.LG])
    Classification tasks are usually evaluated in terms of accuracy. However, accuracy is discontinuous and cannot be directly optimized using gradient ascent. Popular methods minimize cross-entropy, Hinge loss, or other surrogate losses, which can lead to suboptimal results. In this paper, we propose a new optimization framework by introducing stochasticity to a model's output and optimizing expected accuracy, i.e. accuracy of the stochastic model. Extensive experiments on image classification show that the proposed optimization method is a powerful alternative to widely used classification losses.
    Named Entity Recognition, Multi-Task Learning, Nested Entities, BERT, Arabic NER Corpus. (arXiv:2205.09651v1 [cs.CL])
    This paper presents Wojood, a corpus for Arabic nested Named Entity Recognition (NER). Nested entities occur when one entity mention is embedded inside another entity mention. Wojood consists of about 550K Modern Standard Arabic (MSA) and dialect tokens that are manually annotated with 21 entity types including person, organization, location, event and date. More importantly, the corpus is annotated with nested entities instead of the more common flat annotations. The data contains about 75K entities and 22.5% of which are nested. The inter-annotator evaluation of the corpus demonstrated a strong agreement with Cohen's Kappa of 0.979 and an F1-score of 0.976. To validate our data, we used the corpus to train a nested NER model based on multi-task learning and AraBERT (Arabic BERT). The model achieved an overall micro F1-score of 0.884. Our corpus, the annotation guidelines, the source code and the pre-trained model are publicly available.
    The AI Mechanic: Acoustic Vehicle Characterization Neural Networks. (arXiv:2205.09667v1 [cs.SD])
    In a world increasingly dependent on road-based transportation, it is essential to understand vehicles. We introduce the AI mechanic, an acoustic vehicle characterization deep learning system, as an integrated approach using sound captured from mobile devices to enhance transparency and understanding of vehicles and their condition for non-expert users. We develop and implement novel cascading architectures for vehicle understanding, which we define as sequential, conditional, multi-level networks that process raw audio to extract highly-granular insights. To showcase the viability of cascading architectures, we build a multi-task convolutional neural network that predicts and cascades vehicle attributes to enhance fault detection. We train and test these models on a synthesized dataset reflecting more than 40 hours of augmented audio and achieve >92% validation set accuracy on attributes (fuel type, engine configuration, cylinder count and aspiration type). Our cascading architecture additionally achieved 93.6% validation and 86.8% test set accuracy on misfire fault prediction, demonstrating margins of 16.4% / 7.8% and 4.2% / 1.5% improvement over na\"ive and parallel baselines. We explore experimental studies focused on acoustic features, data augmentation, feature fusion, and data reliability. Finally, we conclude with a discussion of broader implications, future directions, and application areas for this work.
    ArabGlossBERT: Fine-Tuning BERT on Context-Gloss Pairs for WSD. (arXiv:2205.09685v1 [cs.CL])
    Using pre-trained transformer models such as BERT has proven to be effective in many NLP tasks. This paper presents our work to fine-tune BERT models for Arabic Word Sense Disambiguation (WSD). We treated the WSD task as a sentence-pair binary classification task. First, we constructed a dataset of labeled Arabic context-gloss pairs (~167k pairs) we extracted from the Arabic Ontology and the large lexicographic database available at Birzeit University. Each pair was labeled as True or False and target words in each context were identified and annotated. Second, we used this dataset for fine-tuning three pre-trained Arabic BERT models. Third, we experimented the use of different supervised signals used to emphasize target words in context. Our experiments achieved promising results (accuracy of 84%) although we used a large set of senses in the experiment.
    A Topological Approach for Semi-Supervised Learning. (arXiv:2205.09617v1 [cs.CV])
    Nowadays, Machine Learning and Deep Learning methods have become the state-of-the-art approach to solve data classification tasks. In order to use those methods, it is necessary to acquire and label a considerable amount of data; however, this is not straightforward in some fields, since data annotation is time consuming and might require expert knowledge. This challenge can be tackled by means of semi-supervised learning methods that take advantage of both labelled and unlabelled data. In this work, we present new semi-supervised learning methods based on techniques from Topological Data Analysis (TDA), a field that is gaining importance for analysing large amounts of data with high variety and dimensionality. In particular, we have created two semi-supervised learning methods following two different topological approaches. In the former, we have used a homological approach that consists in studying the persistence diagrams associated with the data using the Bottleneck and Wasserstein distances. In the latter, we have taken into account the connectivity of the data. In addition, we have carried out a thorough analysis of the developed methods using 3 synthetic datasets, 5 structured datasets, and 2 datasets of images. The results show that the semi-supervised methods developed in this work outperform both the results obtained with models trained with only manually labelled data, and those obtained with classical semi-supervised learning methods, reaching improvements of up to a 16%.
    Data Valuation for Offline Reinforcement Learning. (arXiv:2205.09550v1 [cs.LG])
    The success of deep reinforcement learning (DRL) hinges on the availability of training data, which is typically obtained via a large number of environment interactions. In many real-world scenarios, costs and risks are associated with gathering these data. The field of offline reinforcement learning addresses these issues through outsourcing the collection of data to a domain expert or a carefully monitored program and subsequently searching for a batch-constrained optimal policy. With the emergence of data markets, an alternative to constructing a dataset in-house is to purchase external data. However, while state-of-the-art offline reinforcement learning approaches have shown a lot of promise, they currently rely on carefully constructed datasets that are well aligned with the intended target domains. This raises questions regarding the transferability and robustness of an offline reinforcement learning agent trained on externally acquired data. In this paper, we empirically evaluate the ability of the current state-of-the-art offline reinforcement learning approaches to coping with the source-target domain mismatch within two MuJoCo environments, finding that current state-of-the-art offline reinforcement learning algorithms underperform in the target domain. To address this, we propose data valuation for offline reinforcement learning (DVORL), which allows us to identify relevant and high-quality transitions, improving the performance and transferability of policies learned by offline reinforcement learning algorithms. The results show that our method outperforms offline reinforcement learning baselines on two MuJoCo environments.
    Focused Adversarial Attacks. (arXiv:2205.09624v1 [cs.LG])
    Recent advances in machine learning show that neural models are vulnerable to minimally perturbed inputs, or adversarial examples. Adversarial algorithms are optimization problems that minimize the accuracy of ML models by perturbing inputs, often using a model's loss function to craft such perturbations. State-of-the-art object detection models are characterized by very large output manifolds due to the number of possible locations and sizes of objects in an image. This leads to their outputs being sparse and optimization problems that use them incur a lot of unnecessary computation. We propose to use a very limited subset of a model's learned manifold to compute adversarial examples. Our \textit{Focused Adversarial Attacks} (FA) algorithm identifies a small subset of sensitive regions to perform gradient-based adversarial attacks. FA is significantly faster than other gradient-based attacks when a model's manifold is sparsely activated. Also, its perturbations are more efficient than other methods under the same perturbation constraints. We evaluate FA on the COCO 2017 and Pascal VOC 2007 detection datasets.
    How catastrophic can catastrophic forgetting be in linear regression?. (arXiv:2205.09588v1 [cs.LG])
    To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when T tasks in d dimensions are presented cyclically for k iterations, we prove an upper bound of T^2 * min{1/sqrt(k), d/k} on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the T^2 factor can be lifted when tasks are presented in a random ordering.
    Disentangling Active and Passive Cosponsorship in the U.S. Congress. (arXiv:2205.09674v1 [cs.LG])
    In the U.S. Congress, legislators can use active and passive cosponsorship to support bills. We show that these two types of cosponsorship are driven by two different motivations: the backing of political colleagues and the backing of the bill's content. To this end, we develop an Encoder+RGCN based model that learns legislator representations from bill texts and speech transcripts. These representations predict active and passive cosponsorship with an F1-score of 0.88. Applying our representations to predict voting decisions, we show that they are interpretable and generalize to unseen tasks.
    Smooth densities and generative modeling with unsupervised random forests. (arXiv:2205.09435v1 [stat.ML])
    Density estimation is a fundamental problem in statistics, and any attempt to do so in high dimensions typically requires strong assumptions or complex deep learning architectures. An important application for density estimators is synthetic data generation, an area currently dominated by neural networks that often demand enormous training datasets and extensive tuning. We propose a new method based on unsupervised random forests for estimating smooth densities in arbitrary dimensions without parametric constraints, as well as generating realistic synthetic data. We prove the consistency of our approach and demonstrate its advantages over existing tree-based density estimators, which generally rely on ill-chosen split criteria and do not scale well with data dimensionality. Experiments illustrate that our algorithm compares favorably to state-of-the-art deep learning generative models, achieving superior performance in a range of benchmark trials while executing about two orders of magnitude faster on average. Our method is implemented in easy-to-use $\texttt{R}$ and Python packages.
    IFTT-PIN: A PIN-Entry Method Leveraging the Self-Calibration Paradigm. (arXiv:2205.09534v1 [cs.HC])
    IFTT-PIN is a self-calibrating version of the PIN-entry method introduced in Roth et al. (2004) [1]. In [1], digits are split into two sets and assigned a color respectively. To communicate their digit, users press the button with the same color that is assigned to their digit, which can thus be identified by elimination after a few iterations. IFTT-PIN uses the same principle but does not pre-assign colors to each button. Instead, users are free to choose which button to use for each color. The button-to-color mapping only exists in the user's mind and is never directly communicated to the interface. In other words, IFTT-PIN infers both the user's PIN and their preferred button-to-color mapping at the same time, a process called self-calibration. In this paper, we present online interactive demonstrations of IFTT-PIN (available at https://github.com/jgrizou/IFTT-PIN), with and without self-calibration, and introduce the key concepts and assumptions making self-calibration possible. We review related work in the field of brain-computer interface and further propose self-calibration as a novel approach to protect users against shoulder surfing attacks. Finally, we introduce a vault cracking challenge as a test of usability and security that was informally tested at our institute. With IFTT-PIN, we wish to demonstrate a new interactive experience where users can decide actively and on-the-fly how to use an interface. The self-calibration paradigm might lead to novel opportunities for interaction in other applications or domains. We hope this work will inspire the community to invent them.
    Data-driven prediction of Air Traffic Controllers reactions to resolving conflicts. (arXiv:2205.09539v1 [cs.AI])
    With the aim to enhance automation in conflict detection and resolution (CD&R) tasks in the Air Traffic Management domain, in this paper we propose deep learning techniques (DL) that can learn models of Air Traffic Controllers' (ATCO) reactions in resolving conflicts that can violate separation minimum constraints among aircraft trajectories: This implies learning when the ATCO will react towards resolving a conflict, and how he/she will react. Timely reactions, to which this paper aims, focus on when do reactions happen, aiming to predict the trajectory points, as the trajectory evolves, that the ATCO issues a conflict resolution action, while also predicting the type of resolution action (if any). Towards this goal, the paper formulates the ATCO reactions prediction problem for CD&R, and presents DL methods that can model ATCO timely reactions and evaluates these methods in real-world data sets, showing their efficacy in prediction with very high accuracy.
    Parallel bandit architecture based on laser chaos for reinforcement learning. (arXiv:2205.09543v1 [cs.ET])
    Accelerating artificial intelligence by photonics is an active field of study aiming to exploit the unique properties of photons. Reinforcement learning is an important branch of machine learning, and photonic decision-making principles have been demonstrated with respect to the multi-armed bandit problems. However, reinforcement learning could involve a massive number of states, unlike previously demonstrated bandit problems where the number of states is only one. Q-learning is a well-known approach in reinforcement learning that can deal with many states. The architecture of Q-learning, however, does not fit well photonic implementations due to its separation of update rule and the action selection. In this study, we organize a new architecture for multi-state reinforcement learning as a parallel array of bandit problems in order to benefit from photonic decision-makers, which we call parallel bandit architecture for reinforcement learning or PBRL in short. Taking a cart-pole balancing problem as an instance, we demonstrate that PBRL adapts to the environment in fewer time steps than Q-learning. Furthermore, PBRL yields faster adaptation when operated with a chaotic laser time series than the case with uniformly distributed pseudorandom numbers where the autocorrelation inherent in the laser chaos provides a positive effect. We also find that the variety of states that the system undergoes during the learning phase exhibits completely different properties between PBRL and Q-learning. The insights obtained through the present study are also beneficial for existing computing platforms, not just photonic realizations, in accelerating performances by the PBRL algorithms and correlated random sequences.
    Nebula-I: A General Framework for Collaboratively Training Deep Learning Models on Low-Bandwidth Cloud Clusters. (arXiv:2205.09470v1 [cs.LG])
    The ever-growing model size and scale of compute have attracted increasing interests in training deep learning models over multiple nodes. However, when it comes to training on cloud clusters, especially across remote clusters, huge challenges are faced. In this work, we introduce a general framework, Nebula-I, for collaboratively training deep learning models over remote heterogeneous clusters, the connections between which are low-bandwidth wide area networks (WANs). We took natural language processing (NLP) as an example to show how Nebula-I works in different training phases that include: a) pre-training a multilingual language model using two remote clusters; and b) fine-tuning a machine translation model using knowledge distilled from pre-trained models, which run through the most popular paradigm of recent deep learning. To balance the accuracy and communication efficiency, in Nebula-I, parameter-efficient training strategies, hybrid parallel computing methods and adaptive communication acceleration techniques are jointly applied. Meanwhile, security strategies are employed to guarantee the safety, reliability and privacy in intra-cluster computation and inter-cluster communication. Nebula-I is implemented with the PaddlePaddle deep learning framework, which can support collaborative training over heterogeneous hardware, e.g. GPU and NPU. Experiments demonstrate that the proposed framework could substantially maximize the training efficiency while preserving satisfactory NLP performance. By using Nebula-I, users can run large-scale training tasks over cloud clusters with minimum developments, and the utility of existed large pre-trained models could be further promoted. We also introduced new state-of-the-art results on cross-lingual natural language inference tasks, which are generated based upon a novel learning framework and Nebula-I.
    The Impact of COVID-19 Pandemic on LGBTQ Online Communitie. (arXiv:2205.09511v1 [cs.SI])
    The COVID-19 pandemic has disproportionately impacted the lives of minorities, such as members of the LGBTQ community (lesbian, gay, bisexual, transgender, and queer) due to pre-existing social disadvantages and health disparities. Although extensive research has been carried out on the impact of the COVID-19 pandemic on different aspects of the general population's lives, few studies are focused on the LGBTQ population. In this paper, we identify a group of Twitter users who self-disclose to belong to the LGBTQ community. We develop and evaluate two sets of machine learning classifiers using a pre-pandemic and a during pandemic dataset to identify Twitter posts exhibiting minority stress, which is a unique pressure faced by the members of the LGBTQ population due to their sexual and gender identities. For this task, we collect a set of 20,593,823 posts by 7,241 self-disclosed LGBTQ users and annotate a randomly selected subset of 2800 posts. We demonstrate that our best pre-pandemic and during pandemic models show strong and stable performance for detecting posts that contain minority stress. We investigate the linguistic differences in minority stress posts across pre- and during-pandemic periods. We find that anger words are strongly associated with minority stress during the COVID-19 pandemic. We explore the impact of the pandemic on the emotional states of the LGBTQ population by conducting controlled comparisons with the general population. We adopt propensity score-based matching to perform a causal analysis. The results show that the LBGTQ population have a greater increase in the usage of cognitive words and worsened observable attribute in the usage of positive emotion words than the group of the general population with similar pre-pandemic behavioral attributes.
    Transformers as Neural Augmentors: Class Conditional Sentence Generation via Variational Bayes. (arXiv:2205.09391v1 [cs.CL])
    Data augmentation methods for Natural Language Processing tasks are explored in recent years, however they are limited and it is hard to capture the diversity on sentence level. Besides, it is not always possible to perform data augmentation on supervised tasks. To address those problems, we propose a neural data augmentation method, which is a combination of Conditional Variational Autoencoder and encoder-decoder Transformer model. While encoding and decoding the input sentence, our model captures the syntactic and semantic representation of the input language with its class condition. Following the developments in the past years on pre-trained language models, we train and evaluate our models on several benchmarks to strengthen the downstream tasks. We compare our method with 3 different augmentation techniques. The presented results show that, our model increases the performance of current models compared to other data augmentation techniques with a small amount of computation power.
    Action Conditioned Tactile Prediction: a case study on slip prediction. (arXiv:2205.09430v1 [cs.RO])
    Tactile predictive models can be useful across several robotic manipulation tasks, e.g. robotic pushing, robotic grasping, slip avoidance, and in-hand manipulation. However, available tactile prediction models are mostly studied for image-based tactile sensors and there is no comparison study indicating the best performing models. In this paper, we presented two novel data-driven action-conditioned models for predicting tactile signals during real-world physical robot interaction tasks (1) action condition tactile prediction and (2) action conditioned tactile-video prediction models. We use a magnetic-based tactile sensor that is challenging to analyse and test state-of-the-art predictive models and the only existing bespoke tactile prediction model. We compare the performance of these models with those of our proposed models. We perform the comparison study using our novel tactile enabled dataset containing 51,000 tactile frames of a real-world robotic manipulation task with 11 flat-surfaced household objects. Our experimental results demonstrate the superiority of our proposed tactile prediction models in terms of qualitative, quantitative and slip prediction scores.
    A Boosting Algorithm for Positive-Unlabeled Learning. (arXiv:2205.09485v1 [cs.LG])
    Positive-unlabeled (PU) learning deals with binary classification problems when only positive (P) and unlabeled (U) data are available. A lot of PU methods based on linear models and neural networks have been proposed; however, there still lacks study on how the theoretically sound boosting-style algorithms could work with P and U data. Considering that in some scenarios when neural networks cannot perform as good as boosting algorithms even with fully-supervised data, we propose a novel boosting algorithm for PU learning: Ada-PU, which compares against neural networks. Ada-PU follows the general procedure of AdaBoost while two different distributions of P data are maintained and updated. After a weak classifier is learned on the newly updated distribution, the corresponding combining weight for the final ensemble is estimated using only PU data. We demonstrated that with a smaller set of base classifiers, the proposed method is guaranteed to keep the theoretical properties of boosting algorithm. In experiments, we showed that Ada-PU outperforms neural networks on benchmark PU datasets. We also study a real-world dataset UNSW-NB15 in cyber security and demonstrated that Ada-PU has superior performance for malicious activities detection.
    Predictive Maintenance using Machine Learning. (arXiv:2205.09402v1 [cs.LG])
    Predictive maintenance (PdM) is a concept, which is implemented to effectively manage maintenance plans of the assets by predicting their failures with data driven techniques. In these scenarios, data is collected over a certain period of time to monitor the state of equipment. The objective is to find some correlations and patterns that can help predict and ultimately prevent failures. Equipment in manufacturing industry are often utilized without a planned maintenance approach. Such practise frequently results in unexpected downtime, owing to certain unexpected failures. In scheduled maintenance, the condition of the manufacturing equipment is checked after fixed time interval and if any fault occurs, the component is replaced to avoid unexpected equipment stoppages. On the flip side, this leads to increase in time for which machine is non-functioning and cost of carrying out the maintenance. The emergence of Industry 4.0 and smart systems have led to increasing emphasis on predictive maintenance (PdM) strategies that can reduce the cost of downtime and increase the availability (utilization rate) of manufacturing equipment. PdM also has the potential to bring about new sustainable practices in manufacturing by fully utilizing the useful lives of components.
    Simple Regularisation for Uncertainty-Aware Knowledge Distillation. (arXiv:2205.09526v1 [cs.LG])
    Considering uncertainty estimation of modern neural networks (NNs) is one of the most important steps towards deploying machine learning systems to meaningful real-world applications such as in medicine, finance or autonomous systems. At the moment, ensembles of different NNs constitute the state-of-the-art in both accuracy and uncertainty estimation in different tasks. However, ensembles of NNs are unpractical under real-world constraints, since their computation and memory consumption scale linearly with the size of the ensemble, which increase their latency and deployment cost. In this work, we examine a simple regularisation approach for distribution-free knowledge distillation of ensemble of machine learning models into a single NN. The aim of the regularisation is to preserve the diversity, accuracy and uncertainty estimation characteristics of the original ensemble without any intricacies, such as fine-tuning. We demonstrate the generality of the approach on combinations of toy data, SVHN/CIFAR-10, simple to complex NN architectures and different tasks.
    Threshold Designer Adaptation: Improved Adaptation for Designers in Co-creative Systems. (arXiv:2205.09269v1 [cs.LG])
    To best assist human designers with different styles, Machine Learning (ML) systems need to be able to adapt to them. However, there has been relatively little prior work on how and when to best adapt an ML system to a co-designer. In this paper we present threshold designer adaptation: a novel method for adapting a creative ML model to an individual designer. We evaluate our approach with a human subject study using a co-creative rhythm game design tool. We find that designers prefer our proposed method and produce higher quality content in comparison to an existing baseline.
    CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies. (arXiv:2205.09433v1 [cs.LG])
    Reinforcement Learning has drawn huge interest as a tool for solving optimal control problems. Solving a given problem (task or environment) involves converging towards an optimal policy. However, there might exist multiple optimal policies that can dramatically differ in their behaviour; for example, some may be faster than the others but at the expense of greater risk. We consider and study a distribution of optimal policies. We design a curiosity-augmented Metropolis algorithm (CAMEO), such that we can sample optimal policies, and such that these policies effectively adopt diverse behaviours, since this implies greater coverage of the different possible optimal policies. In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems, and even in the challenging case of environments that provide sparse rewards. We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability, and represents a first step towards learning the distribution of optimal policies itself.
    An Approach to Investigate Public Opinion, Views, and Perspectives Towards Exoskeleton Technology. (arXiv:2205.09151v1 [cs.HC])
    Over the last decade, exoskeletons have had an extensive impact on different disciplines and application domains such as assisted living, military, healthcare, firefighting, and industries, on account of their diverse and dynamic functionalities to augment human abilities, stamina, potential, and performance in a multitude of ways. In view of this wide-scale applicability and use-cases of exoskeletons, it is crucial to investigate and analyze the public opinion, views, and perspectives towards exoskeletons which would help to interpret the effectiveness of the underlining human-robot, human-machine, and human-technology interactions. The Internet of Everything era of today's living, characterized by people spending more time on the internet than ever before, holds the potential for the investigation of the same by mining and analyzing relevant web behavior, specifically from social media, that can be interpreted to understand public opinion, views, and perspectives towards a topic or set of topics. Therefore, this paper aims to address this research challenge related to exoskeletons by utilizing the potential of web behavior-based Big Data mining in the modern-day Internet of Everything era. As Twitter is one of the most popular social media platforms on a global scale - characterized by both the number of users and the amount of time spent by its users on the platform - this work focused on investigating web behavior on Twitter to interpret the public opinion, views, and perspectives towards exoskeleton technology. A total of approximately 20,000 tweets related to exoskeletons were used to evaluate the effectiveness of the proposed approach. The results presented and discussed uphold the efficacy of the proposed approach to interpret and analyze the public opinion, views, and perspectives towards exoskeletons from the associated tweets.
    COVID-19 Monitoring System using Social Distancing and Face Mask Detection on Surveillance video datasets. (arXiv:2110.03905v2 [cs.CV] UPDATED)
    In the current times, the fear and danger of COVID-19 virus still stands large. Manual monitoring of social distancing norms is impractical with a large population moving about and with insufficient task force and resources to administer them. There is a need for a lightweight, robust and 24X7 video-monitoring system that automates this process. This paper proposes a comprehensive and effective solution to perform person detection, social distancing violation detection, face detection and face mask classification using object detection, clustering and Convolution Neural Network (CNN) based binary classifier. For this, YOLOv3, Density-based spatial clustering of applications with noise (DBSCAN), Dual Shot Face Detector (DSFD) and MobileNetV2 based binary classifier have been employed on surveillance video datasets. This paper also provides a comparative study of different face detection and face mask classification models. Finally, a video dataset labelling method is proposed along with the labelled video dataset to compensate for the lack of dataset in the community and is used for evaluation of the system. The system performance is evaluated in terms of accuracy, F1 score as well as the prediction time, which has to be low for practical applicability. The system performs with an accuracy of 91.2% and F1 score of 90.79% on the labelled video dataset and has an average prediction time of 7.12 seconds for 78 frames of a video.
    Turbulent field fluctuations in gyrokinetic and fluid plasmas. (arXiv:2107.09744v2 [physics.plasm-ph] CROSS LISTED)
    A key uncertainty in the design and development of magnetic confinement fusion energy reactors is predicting edge plasma turbulence. An essential step in overcoming this uncertainty is the validation in accuracy of reduced turbulent transport models. Drift-reduced Braginskii two-fluid theory is one such set of reduced equations that has for decades simulated boundary plasmas in experiment, but significant questions exist regarding its predictive ability. To this end, using a novel physics-informed deep learning framework, we demonstrate the first ever direct quantitative comparisons of turbulent field fluctuations between electrostatic two-fluid theory and electromagnetic gyrokinetic modelling with good overall agreement found in magnetized helical plasmas at low normalized pressure. This framework is readily adaptable to experimental and astrophysical environments, and presents a new technique for the numerical validation and discovery of reduced global plasma turbulence models.
    Federated Learning: Applications, Challenges and Future Scopes. (arXiv:2205.09513v1 [cs.LG])
    Federated learning (FL) is a system in which a central aggregator coordinates the efforts of multiple clients to solve machine learning problems. This setting allows training data to be dispersed in order to protect privacy. The purpose of this paper is to provide an overview of FL systems with a focus on healthcare. FL is evaluated here based on its frameworks, architectures, and applications. It is shown here that FL solves the preceding issues with a shared global deep learning (DL) model via a central aggregator server. This paper examines recent developments and provides a comprehensive list of unresolved issues, inspired by the rapid growth of FL research. In the context of FL, several privacy methods are described, including secure multiparty computation, homomorphic encryption, differential privacy, and stochastic gradient descent. Furthermore, a review of various FL classes, such as horizontal and vertical FL and federated transfer learning, is provided. FL has applications in wireless communication, service recommendation, intelligent medical diagnosis systems, and healthcare, all of which are discussed in this paper. We also present a thorough review of existing FL challenges, such as privacy protection, communication cost, system heterogeneity, and unreliable model upload, followed by future research directions.
    Certified Error Control of Candidate Set Pruning for Two-Stage Relevance Ranking. (arXiv:2205.09638v1 [cs.IR])
    In information retrieval (IR), candidate set pruning has been commonly used to speed up two-stage relevance ranking. However, such an approach lacks accurate error control and often trades accuracy off against computational efficiency in an empirical fashion, lacking theoretical guarantees. In this paper, we propose the concept of certified error control of candidate set pruning for relevance ranking, which means that the test error after pruning is guaranteed to be controlled under a user-specified threshold with high probability. Both in-domain and out-of-domain experiments show that our method successfully prunes the first-stage retrieved candidate sets to improve the second-stage reranking speed while satisfying the pre-specified accuracy constraints in both settings. For example, on MS MARCO Passage v1, our method yields an average candidate set size of 27 out of 1,000 which increases the reranking speed by about 37 times, while the MRR@10 is greater than a pre-specified value of 0.38 with about 90% empirical coverage and the empirical baselines fail to provide such guarantee. Code and data are available at: https://github.com/alexlimh/CEC-Ranking.
    On the Convergence of Policy in Unregularized Policy Mirror Descent. (arXiv:2205.08176v2 [math.OC] UPDATED)
    In this short note, we give the convergence analysis of the policy in the recent famous policy mirror descent (PMD). We mainly consider the unregularized setting following [11] with generalized Bregman divergence. The difference is that we directly give the convergence rates of policy under generalized Bregman divergence. Our results are inspired by the convergence of value function in previous works and are an extension study of policy mirror descent. Though some results have already appeared in previous work, we further discover a large body of Bregman divergences could give finite-step convergence to an optimal policy, such as the classical Euclidean distance.
    Bi-LSTM Scoring Based Similarity Measurement with Agglomerative Hierarchical Clustering (AHC) for Speaker Diarization. (arXiv:2205.09709v1 [eess.AS])
    Majority of speech signals across different scenarios are never available with well-defined audio segments containing only a single speaker. A typical conversation between two speakers consists of segments where their voices overlap, interrupt each other or halt their speech in between multiple sentences. Recent advancements in diarization technology leverage neural network-based approaches to improvise multiple subsystems of speaker diarization system comprising of extracting segment-wise embedding features and detecting changes in the speaker during conversation. However, to identify speaker through clustering, models depend on methodologies like PLDA to generate similarity measure between two extracted segments from a given conversational audio. Since these algorithms ignore the temporal structure of conversations, they tend to achieve a higher Diarization Error Rate (DER), thus leading to misdetections both in terms of speaker and change identification. Therefore, to compare similarity of two speech segments both independently and sequentially, we propose a Bi-directional Long Short-term Memory network for estimating the elements present in the similarity matrix. Once the similarity matrix is generated, Agglomerative Hierarchical Clustering (AHC) is applied to further identify speaker segments based on thresholding. To evaluate the performance, Diarization Error Rate (DER%) metric is used. The proposed model achieves a low DER of 34.80% on a test set of audio samples derived from ICSI Meeting Corpus as compared to traditional PLDA based similarity measurement mechanism which achieved a DER of 39.90%.
    Can language models learn from explanations in context?. (arXiv:2204.02329v2 [cs.CL] UPDATED)
    Large language models can perform new tasks by adapting to a few in-context examples. For humans, rapid learning from examples can benefit from explanations that connect examples to task principles. We therefore investigate whether explanations of few-shot examples can allow language models to adapt more effectively. We annotate a set of 40 challenging tasks from BIG-Bench with explanations of answers to a small subset of questions, as well as a variety of matched control explanations. We evaluate the effects of various zero-shot and few-shot prompts that include different types of explanations, instructions, and controls on the performance of a range of large language models. We analyze these results using statistical multilevel modeling techniques that account for the nested dependencies among conditions, tasks, prompts, and models. We find that explanations of examples can improve performance. Adding untuned explanations to a few-shot prompt offers a modest improvement in performance; about 1/3 the effect size of adding few-shot examples, but twice the effect size of task instructions. We then show that explanations tuned for performance on a small validation set offer substantially larger benefits; building a prompt by selecting examples and explanations together substantially improves performance over selecting examples alone. Hand-tuning explanations can substantially improve performance on challenging tasks. Furthermore, even untuned explanations outperform carefully matched controls, suggesting that the benefits are due to the link between an example and its explanation, rather than lower-level features of the language used. However, only large models can benefit from explanations. In summary, explanations can support the in-context learning abilities of large language models on challenging tasks.
    Diagonal State Spaces are as Effective as Structured State Spaces. (arXiv:2203.14343v3 [cs.LG] UPDATED)
    Modeling long range dependencies in sequential data is a fundamental step towards attaining human-level performance in many modalities such as text, vision, audio and video. While attention-based models are a popular and effective choice in modeling short-range interactions, their performance on tasks requiring long range reasoning has been largely inadequate. In an exciting result, Gu et al. (ICLR 2022) proposed the $\textit{Structured State Space}$ (S4) architecture delivering large gains over state-of-the-art models on several long-range tasks across various modalities. The core proposition of S4 is the parameterization of state matrices via a diagonal plus low rank structure, allowing efficient computation. In this work, we show that one can match the performance of S4 even without the low rank correction and thus assuming the state matrices to be diagonal. Our $\textit{Diagonal State Space}$ (DSS) model matches the performance of S4 on Long Range Arena tasks, speech classification on Speech Commands dataset, while being conceptually simpler and straightforward to implement.
    TourBERT: A pretrained language model for the tourism industry. (arXiv:2201.07449v3 [cs.CL] UPDATED)
    The Bidirectional Encoder Representations from Transformers (BERT) is currently one of the most important and state-of-the-art models for natural language. However, it has also been shown that for domain-specific tasks it is helpful to pretrain BERT on a domain-specific corpus. In this paper, we present TourBERT, a pretrained language model for tourism. We describe how TourBERT was developed and evaluated. The evaluations show that TourBERT is outperforming BERT in all tourism-specific tasks.
    Generalization Analysis of Message Passing Neural Networks on Large Random Graphs. (arXiv:2202.00645v4 [cs.LG] UPDATED)
    Message passing neural networks (MPNN) have seen a steep rise in popularity since their introduction as generalizations of convolutional neural networks to graph-structured data, and are now considered state-of-the-art tools for solving a large variety of graph-focused problems. We study the generalization error of MPNNs in graph classification and regression. We assume that graphs of different classes are sampled from different random graph models. We show that, when training a MPNN on a dataset sampled from such a distribution, the generalization gap increases in the complexity of the MPNN, and decreases, not only with respect to the number of training samples, but also with the average number of nodes in the graphs. This shows how a MPNN with high complexity can generalize from a small dataset of graphs, as long as the graphs are large. The generalization bound is derived from a uniform convergence result, that shows that any MPNN, applied on a graph, approximates the MPNN applied on the geometric model that the graph discretizes.
    Learning Graph Structure from Convolutional Mixtures. (arXiv:2205.09575v1 [cs.LG])
    Machine learning frameworks such as graph neural networks typically rely on a given, fixed graph to exploit relational inductive biases and thus effectively learn from network data. However, when said graphs are (partially) unobserved, noisy, or dynamic, the problem of inferring graph structure from data becomes relevant. In this paper, we postulate a graph convolutional relationship between the observed and latent graphs, and formulate the graph learning task as a network inverse (deconvolution) problem. In lieu of eigendecomposition-based spectral methods or iterative optimization solutions, we unroll and truncate proximal gradient iterations to arrive at a parameterized neural network architecture that we call a Graph Deconvolution Network (GDN). GDNs can learn a distribution of graphs in a supervised fashion, perform link prediction or edge-weight regression tasks by adapting the loss function, and they are inherently inductive. We corroborate GDN's superior graph recovery performance and its generalization to larger graphs using synthetic data in supervised settings. Furthermore, we demonstrate the robustness and representation power of GDNs on real world neuroimaging and social network datasets.
    Cracking White-box DNN Watermarks via Invariant Neuron Transforms. (arXiv:2205.00199v2 [cs.CR] UPDATED)
    Recently, how to protect the Intellectual Property (IP) of deep neural networks (DNN) becomes a major concern for the AI industry. To combat potential model piracy, recent works explore various watermarking strategies to embed secret identity messages into the prediction behaviors or the internals (e.g., weights and neuron activation) of the target model. Sacrificing less functionality and involving more knowledge about the target model, the latter branch of watermarking schemes (i.e., white-box model watermarking) is claimed to be accurate, credible and secure against most known watermark removal attacks, with emerging research efforts and applications in the industry. In this paper, we present the first effective removal attack which cracks almost all the existing white-box watermarking schemes with provably no performance overhead and no required prior knowledge. By analyzing these IP protection mechanisms at the granularity of neurons, we for the first time discover their common dependence on a set of fragile features of a local neuron group, all of which can be arbitrarily tampered by our proposed chain of invariant neuron transforms. On $9$ state-of-the-art white-box watermarking schemes and a broad set of industry-level DNN architectures, our attack for the first time reduces the embedded identity message in the protected models to be almost random. Meanwhile, unlike known removal attacks, our attack requires no prior knowledge on the training data distribution or the adopted watermark algorithms, and leaves model functionality intact.
    Understanding Gradient Descent on Edge of Stability in Deep Learning. (arXiv:2205.09745v1 [cs.LG])
    Deep learning experiments in Cohen et al. (2021) using deterministic Gradient Descent (GD) revealed an {\em Edge of Stability (EoS)} phase when learning rate (LR) and sharpness (\emph{i.e.}, the largest eigenvalue of Hessian) no longer behave as in traditional optimization. Sharpness stabilizes around $2/$LR and loss goes up and down across iterations, yet still with an overall downward trend. The current paper mathematically analyzes a new mechanism of implicit regularization in the EoS phase, whereby GD updates due to non-smooth loss landscape turn out to evolve along some deterministic flow on the manifold of minimum loss. This is in contrast to many previous results about implicit bias either relying on infinitesimal updates or noise in gradient. Formally, for any smooth function $L$ with certain regularity condition, this effect is demonstrated for (1) {\em Normalized GD}, i.e., GD with a varying LR $ \eta_t =\frac{ \eta }{ || \nabla L(x(t)) || } $ and loss $L$; (2) GD with constant LR and loss $\sqrt{L}$. Both provably enter the Edge of Stability, with the associated flow on the manifold minimizing $\lambda_{\max}(\nabla^2 L)$. The above theoretical results have been corroborated by an experimental study.
    Closing the gap: Exact maximum likelihood training of generative autoencoders using invertible layers. (arXiv:2205.09546v1 [stat.ML])
    In this work, we provide an exact likelihood alternative to the variational training of generative autoencoders. We show that VAE-style autoencoders can be constructed using invertible layers, which offer a tractable exact likelihood without the need for any regularization terms. This is achieved while leaving complete freedom in the choice of encoder, decoder and prior architectures, making our approach a drop-in replacement for the training of existing VAEs and VAE-style models. We refer to the resulting models as Autoencoders within Flows (AEF), since the encoder, decoder and prior are defined as individual layers of an overall invertible architecture. We show that the approach results in strikingly higher performance than architecturally equivalent VAEs in term of log-likelihood, sample quality and denoising performance. In a broad sense, the main ambition of this work is to close the gap between the normalizing flow and autoencoder literature under the common framework of invertibility and exact maximum likelihood.
    A Simple Yet Effective SVD-GCN for Directed Graphs. (arXiv:2205.09335v1 [cs.LG])
    In this paper, we propose a simple yet effective graph neural network for directed graphs (digraph) based on the classic Singular Value Decomposition (SVD), named SVD-GCN. The new graph neural network is built upon the graph SVD-framelet to better decompose graph signals on the SVD ``frequency'' bands. Further the new framelet SVD-GCN is also scaled up for larger scale graphs via using Chebyshev polynomial approximation. Through empirical experiments conducted on several node classification datasets, we have found that SVD-GCN has remarkable improvements in a variety of graph node learning tasks and it outperforms GCN and many other state-of-the-art graph neural networks for digraphs. Moreover, we empirically demonstate that the SVD-GCN has great denoising capability and robustness to high level graph data attacks. The theoretical and experimental results prove that the SVD-GCN is effective on a variant of graph datasets, meanwhile maintaining stable and even better performance than the state-of-the-arts.
    Bayesian Negative Sampling for Recommendation. (arXiv:2204.06520v2 [cs.IR] UPDATED)
    How to sample high quality negative instances from unlabeled data, i.e., negative sampling, is important for training implicit collaborative filtering and contrastive learning models. Although previous studies have proposed some approaches to sample informative instances, few has been done to discriminating false negative from true negative for unbiased negative sampling. On the basis of our order relation analysis of negatives' scores, we first derive the class conditional density of true negatives and that of false negatives. We next design a Bayesian classifier for negative classification, from which we define a model-agnostic posterior probability estimate of an instance being true negative as a quantitative negative signal measure. We also propose a Bayesian optimal sampling rule to sample high-quality negatives. The proposed Bayesian Negative Sampling (BNS) algorithm has a linear time complexity. Experimental studies validate the superiority of BNS over the peers in terms of better sampling quality and better recommendation performance.
    PSI Draft Specification. (arXiv:2205.09488v1 [cs.SE])
    This document presents the draft specification for delivering machine learning services over HTTP, developed as part of the Protocols and Structures for Inference project, which concluded in 2013. It presents the motivation for providing machine learning as a service, followed by a description of the essential and optional components of such a service.
    Improving VAE based molecular representations for compound property prediction. (arXiv:2201.04929v3 [cs.LG] UPDATED)
    Collecting labeled data for many important tasks in chemoinformatics is time consuming and requires expensive experiments. In recent years, machine learning has been used to learn rich representations of molecules using large scale unlabeled molecular datasets and transfer the knowledge to solve the more challenging tasks with limited datasets. Variational autoencoders are one of the tools that have been proposed to perform the transfer for both chemical property prediction and molecular generation tasks. In this work we propose a simple method to improve chemical property prediction performance of machine learning models by incorporating additional information on correlated molecular descriptors in the representations learned by variational autoencoders. We verify the method on three property prediction asks. We explore the impact of the number of incorporated descriptors, correlation between the descriptors and the target properties, sizes of the datasets etc. Finally, we show the relation between the performance of property prediction models and the distance between property prediction dataset and the larger unlabeled dataset in the representation space.
    Morse-STF: Improved Protocols for Privacy-Preserving Machine Learning. (arXiv:2109.11726v2 [cs.CR] UPDATED)
    Secure multi-party computation enables multiple mutually distrusting parties to perform computations on data without revealing the data itself, and has become one of the core technologies behind privacy-preserving machine learning. In this work, we present several improved privacy-preserving protocols for both linear and non-linear layers in machine learning. For linear layers, we present an extended beaver triple protocol for bilinear maps that significantly reduces communication of convolution layer. For non-linear layers, we introduce novel protocols for computing the sigmoid and softmax function. Both functions are essential building blocks for machine learning training of classification tasks. Our protocols are both more scalable and robust than prior constructions, and improves runtime performance by 3-17x. Finally, we introduce Morse-STF, an end-to-end privacy-preserving system for machine learning training that leverages all these improved protocols. Our system achieves a 1.8x speedup on logistic regression and 3.9-4.9x speedup on convolutional neural networks compared to prior state-of-the-art systems.
    Differential Privacy: What is all the noise about?. (arXiv:2205.09453v1 [cs.CR])
    Differential Privacy (DP) is a formal definition of privacy that provides rigorous guarantees against risks of privacy breaches during data processing. It makes no assumptions about the knowledge or computational power of adversaries, and provides an interpretable, quantifiable and composable formalism. DP has been actively researched during the last 15 years, but it is still hard to master for many Machine Learning (ML)) practitioners. This paper aims to provide an overview of the most important ideas, concepts and uses of DP in ML, with special focus on its intersection with Federated Learning (FL).
    An Invariant Matching Property for Distribution Generalization under Intervened Response. (arXiv:2205.09162v1 [stat.ME])
    The task of distribution generalization concerns making reliable prediction of a response in unseen environments. The structural causal models are shown to be useful to model distribution changes through intervention. Motivated by the fundamental invariance principle, it is often assumed that the conditional distribution of the response given its predictors remains the same across environments. However, this assumption might be violated in practical settings when the response is intervened. In this work, we investigate a class of model with an intervened response. We identify a novel form of invariance by incorporating the estimates of certain features as additional predictors. Effectively, we show this invariance is equivalent to having a deterministic linear matching that makes the generalization possible. We provide an explicit characterization of the linear matching and present our simulation results under various intervention settings.
    Cross-lingual Transfer of Monolingual Models. (arXiv:2109.07348v2 [cs.CL] UPDATED)
    Recent studies in zero-shot cross-lingual learning using multilingual models have falsified the previous hypothesis that shared vocabulary and joint pre-training are the keys to cross-lingual generalization. Inspired by this advancement, we introduce a cross-lingual transfer method for monolingual models based on domain adaptation. We study the effects of such transfer from four different languages to English. Our experimental results on GLUE show that the transferred models outperform the native English model independently of the source language. After probing the English linguistic knowledge encoded in the representations before and after transfer, we find that semantic information is retained from the source language, while syntactic information is learned during transfer. Additionally, the results of evaluating the transferred models in source language tasks reveal that their performance in the source domain deteriorates after transfer.
    Neural ODE Control for Trajectory Approximation of Continuity Equation. (arXiv:2205.09241v1 [math.OC])
    We consider the controllability problem for the continuity equation, corresponding to neural ordinary differential equations (ODEs), which describes how a probability measure is pushedforward by the flow. We show that the controlled continuity equation has very strong controllability properties. Particularly, a given solution of the continuity equation corresponding to a bounded Lipschitz vector field defines a trajectory on the set of probability measures. For this trajectory, we show that there exist piecewise constant training weights for a neural ODE such that the solution of the continuity equation corresponding to the neural ODE is arbitrarily close to it. As a corollary to this result, we establish that the continuity equation of the neural ODE is approximately controllable on the set of compactly supported probability measures that are absolutely continuous with respect to the Lebesgue measure.
    Neighborhood Mixup Experience Replay: Local Convex Interpolation for Improved Sample Efficiency in Continuous Control Tasks. (arXiv:2205.09117v1 [cs.LG])
    Experience replay plays a crucial role in improving the sample efficiency of deep reinforcement learning agents. Recent advances in experience replay propose using Mixup (Zhang et al., 2018) to further improve sample efficiency via synthetic sample generation. We build upon this technique with Neighborhood Mixup Experience Replay (NMER), a geometrically-grounded replay buffer that interpolates transitions with their closest neighbors in state-action space. NMER preserves a locally linear approximation of the transition manifold by only applying Mixup between transitions with vicinal state-action features. Under NMER, a given transition's set of state action neighbors is dynamic and episode agnostic, in turn encouraging greater policy generalizability via inter-episode interpolation. We combine our approach with recent off-policy deep reinforcement learning algorithms and evaluate on continuous control environments. We observe that NMER improves sample efficiency by an average 94% (TD3) and 29% (SAC) over baseline replay buffers, enabling agents to effectively recombine previous experiences and learn from limited data.
    Relational representation learning with spike trains. (arXiv:2205.09140v1 [cs.NE])
    Relational representation learning has lately received an increase in interest due to its flexibility in modeling a variety of systems like interacting particles, materials and industrial projects for, e.g., the design of spacecraft. A prominent method for dealing with relational data are knowledge graph embedding algorithms, where entities and relations of a knowledge graph are mapped to a low-dimensional vector space while preserving its semantic structure. Recently, a graph embedding method has been proposed that maps graph elements to the temporal domain of spiking neural networks. However, it relies on encoding graph elements through populations of neurons that only spike once. Here, we present a model that allows us to learn spike train-based embeddings of knowledge graphs, requiring only one neuron per graph element by fully utilizing the temporal domain of spike patterns. This coding scheme can be implemented with arbitrary spiking neuron models as long as gradients with respect to spike times can be calculated, which we demonstrate for the integrate-and-fire neuron model. In general, the presented results show how relational knowledge can be integrated into spike-based systems, opening up the possibility of merging event-based computing and relational data to build powerful and energy efficient artificial intelligence applications and reasoning systems.
    Accelerated Training of Physics Informed Neural Networks (PINNs) using Meshless Discretizations. (arXiv:2205.09332v1 [cs.LG])
    We present a new technique for the accelerated training of physics-informed neural networks (PINNs): discretely-trained PINNs (DT-PINNs). The repeated computation of partial derivative terms in the PINN loss functions via automatic differentiation during training is known to be computationally expensive, especially for higher-order derivatives. DT-PINNs are trained by replacing these exact spatial derivatives with high-order accurate numerical discretizations computed using meshless radial basis function-finite differences (RBF-FD) and applied via sparse-matrix vector multiplication. The use of RBF-FD allows for DT-PINNs to be trained even on point cloud samples placed on irregular domain geometries. Additionally, though traditional PINNs (vanilla-PINNs) are typically stored and trained in 32-bit floating-point (fp32) on the GPU, we show that for DT-PINNs, using fp64 on the GPU leads to significantly faster training times than fp32 vanilla-PINNs with comparable accuracy. We demonstrate the efficiency and accuracy of DT-PINNs via a series of experiments. First, we explore the effect of network depth on both numerical and automatic differentiation of a neural network with random weights and show that RBF-FD approximations of third-order accuracy and above are more efficient while being sufficiently accurate. We then compare the DT-PINNs to vanilla-PINNs on both linear and nonlinear Poisson equations and show that DT-PINNs achieve similar losses with 2-4x faster training times on a consumer GPU. Finally, we also demonstrate that similar results can be obtained for the PINN solution to the heat equation (a space-time problem) by discretizing the spatial derivatives using RBF-FD and using automatic differentiation for the temporal derivative. Our results show that fp64 DT-PINNs offer a superior cost-accuracy profile to fp32 vanilla-PINNs.
    Dataset Pruning: Reducing Training Data by Examining Generalization Influence. (arXiv:2205.09329v1 [cs.LG])
    The great success of deep learning heavily relies on increasingly larger training data, which comes at a price of huge computational and infrastructural costs. This poses crucial questions that, do all training data contribute to model's performance? How much does each individual training sample or a sub-training-set affect the model's generalization, and how to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance? To answer these, we propose dataset pruning, an optimization-based sample selection method that can (1) examine the influence of removing a particular set of training samples on model's generalization ability with theoretical guarantee, and (2) construct a smallest subset of training data that yields strictly constrained generalization gap. The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40% training examples on the CIFAR-10 dataset, halves the convergence time with only 1.3% test accuracy decrease, which is superior to previous score-based sample selection methods.
    GitRanking: A Ranking of GitHub Topics for Software Classification using Active Sampling. (arXiv:2205.09379v1 [cs.SE])
    GitHub is the world's largest host of source code, with more than 150M repositories. However, most of these repositories are not labeled or inadequately so, making it harder for users to find relevant projects. There have been various proposals for software application domain classification over the past years. However, these approaches lack a well-defined taxonomy that is hierarchical, grounded in a knowledge base, and free of irrelevant terms. This work proposes GitRanking, a framework for creating a classification ranked into discrete levels based on how general or specific their meaning is. We collected 121K topics from GitHub and considered $60\%$ of the most frequent ones for the ranking. GitRanking 1) uses active sampling to ensure a minimal number of required annotations; and 2) links each topic to Wikidata, reducing ambiguities and improving the reusability of the taxonomy. Our results show that developers, when annotating their projects, avoid using terms with a high degree of specificity. This makes the finding and discovery of their projects more challenging for other users. Furthermore, we show that GitRanking can effectively rank terms according to their general or specific meaning. This ranking would be an essential asset for developers to build upon, allowing them to complement their annotations with more precise topics. Finally, we show that GitRanking is a dynamically extensible method: it can currently accept further terms to be ranked with a minimum number of annotations ($\sim$ 15). This paper is the first collective attempt to build a ground-up taxonomy of software domains.
    PredictionNet: Real-Time Joint Probabilistic Traffic Prediction for Planning, Control, and Simulation. (arXiv:2109.11094v2 [cs.RO] UPDATED)
    Predicting the future motion of traffic agents is crucial for safe and efficient autonomous driving. To this end, we present PredictionNet, a deep neural network (DNN) that predicts the motion of all surrounding traffic agents together with the ego-vehicle's motion. All predictions are probabilistic and are represented in a simple top-down rasterization that allows an arbitrary number of agents. Conditioned on a multi-layer map with lane information, the network outputs future positions, velocities, and backtrace vectors jointly for all agents including the ego-vehicle in a single pass. Trajectories are then extracted from the output. The network can be used to simulate realistic traffic, and it produces competitive results on popular benchmarks. More importantly, it has been used to successfully control a real-world vehicle for hundreds of kilometers, by combining it with a motion planning/control subsystem. The network runs faster than real-time on an embedded GPU, and the system shows good generalization (across sensory modalities and locations) due to the choice of input representation. Furthermore, we demonstrate that by extending the DNN with reinforcement learning (RL), it can better handle rare or unsafe events like aggressive maneuvers and crashes.
    Efficient and Modular Implicit Differentiation. (arXiv:2105.15183v4 [cs.LG] UPDATED)
    Automatic differentiation (autodiff) has revolutionized machine learning. It allows to express complex computations by composing elementary ones in creative ways and removes the burden of computing their derivatives by hand. More recently, differentiation of optimization problem solutions has attracted widespread attention with applications such as optimization layers, and in bi-level problems such as hyper-parameter optimization and meta-learning. However, so far, implicit differentiation remained difficult to use for practitioners, as it often required case-by-case tedious mathematical derivations and implementations. In this paper, we propose automatic implicit differentiation, an efficient and modular approach for implicit differentiation of optimization problems. In our approach, the user defines directly in Python a function $F$ capturing the optimality conditions of the problem to be differentiated. Once this is done, we leverage autodiff of $F$ and the implicit function theorem to automatically differentiate the optimization problem. Our approach thus combines the benefits of implicit differentiation and autodiff. It is efficient as it can be added on top of any state-of-the-art solver and modular as the optimality condition specification is decoupled from the implicit differentiation mechanism. We show that seemingly simple principles allow to recover many existing implicit differentiation methods and create new ones easily. We demonstrate the ease of formulating and solving bi-level optimization problems using our framework. We also showcase an application to the sensitivity analysis of molecular dynamics.
    SEMI: Self-supervised Exploration via Multisensory Incongruity. (arXiv:2009.12494v2 [cs.LG] UPDATED)
    Efficient exploration is a long-standing problem in reinforcement learning since extrinsic rewards are usually sparse or missing. A popular solution to this issue is to feed an agent with novelty signals as intrinsic rewards. In this work, we introduce SEMI, a self-supervised exploration policy by incentivizing the agent to maximize a new novelty signal: multisensory incongruity, which can be measured in two aspects, perception incongruity and action incongruity. The former represents the misalignment of the multisensory inputs, while the latter represents the variance of an agent's policies under different sensory inputs. Specifically, an alignment predictor is learned to detect whether multiple sensory inputs are aligned, the error of which is used to measure perception incongruity. A policy model takes different combinations of the multisensory observations as input and outputs actions for exploration. The variance of actions is further used to measure action incongruity. Using both incongruities as intrinsic rewards, SEMI allows an agent to learn skills by exploring in a self-supervised manner without any external rewards. We further show that SEMI is compatible with extrinsic rewards and it improves sample efficiency of policy learning. The effectiveness of SEMI is demonstrated across a variety of benchmark environments including object manipulation and audio-visual games.
    A Classification of $G$-invariant Shallow Neural Networks. (arXiv:2205.09219v1 [cs.LG])
    When trying to fit a deep neural network (DNN) to a $G$-invariant target function with respect to a group $G$, it only makes sense to constrain the DNN to be $G$-invariant as well. However, there can be many different ways to do this, thus raising the problem of "$G$-invariant neural architecture design": What is the optimal $G$-invariant architecture for a given problem? Before we can consider the optimization problem itself, we must understand the search space, the architectures in it, and how they relate to one another. In this paper, we take a first step towards this goal; we prove a theorem that gives a classification of all $G$-invariant single-hidden-layer or "shallow" neural network ($G$-SNN) architectures with ReLU activation for any finite orthogonal group $G$. The proof is based on a correspondence of every $G$-SNN to a signed permutation representation of $G$ acting on the hidden neurons. The classification is equivalently given in terms of the first cohomology classes of $G$, thus admitting a topological interpretation. Based on a code implementation, we enumerate the $G$-SNN architectures for some example groups $G$ and visualize their structure. We draw the network morphisms between the enumerated architectures that can be leveraged during neural architecture search (NAS). Finally, we prove that architectures corresponding to inequivalent cohomology classes in a given cohomology ring coincide in function space only when their weight matrices are zero, and we discuss the implications of this in the context of NAS.
    Spurious Local Minima of Deep ReLU Neural Networks in the Neural Tangent Kernel Regime. (arXiv:1806.04884v3 [stat.ML] UPDATED)
    In this paper, we theoretically prove that the deep ReLU neural networks do not lie in spurious local minima in the loss landscape under the Neural Tangent Kernel (NTK) regime, that is, in the gradient descent training dynamics of the deep ReLU neural networks whose parameters are initialized by a normal distribution in the limit as the widths of the hidden layers tend to infinity.
    Hybrid Intelligent Testing in Simulation-Based Verification. (arXiv:2205.09552v1 [cs.AR])
    Efficient and effective testing for simulation-based hardware verification is challenging. Using constrained random test generation, several millions of tests may be required to achieve coverage goals. The vast majority of tests do not contribute to coverage progress, yet they consume verification resources. In this paper, we propose a hybrid intelligent testing approach combining two methods that have previously been treated separately, namely Coverage-Directed Test Selection and Novelty-Driven Verification. Coverage-Directed Test Selection learns from coverage feedback to bias testing towards the most effective tests. Novelty-Driven Verification learns to identify and simulate stimuli that differ from previous stimuli, thereby reducing the number of simulations and increasing testing efficiency. We discuss the strengths and limitations of each method, and we show how our approach addresses each method's limitations, leading to hardware testing that is both efficient and effective.
    Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks. (arXiv:2205.09653v1 [stat.ML])
    We analyze feature learning in infinite width neural networks trained with gradient flow through a self-consistent dynamical field theory. We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points, providing a reduced description of network activity through training. These kernel order parameters collectively define the hidden layer activation distribution, the evolution of the neural tangent kernel, and consequently output predictions. For deep linear networks, these kernels satisfy a set of algebraic matrix equations. For nonlinear networks, we provide an alternating sampling procedure to self-consistently solve for the kernel order parameters. We provide comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory, showing that each of these approximations can break down in regimes where general self-consistent solutions still provide an accurate description. Lastly, we provide experiments in more realistic settings which demonstrate that the loss and kernel dynamics of CNNs at fixed feature learning strength is preserved across different widths on a CIFAR classification task.
    What Is Fairness? Implications For FairML. (arXiv:2205.09622v1 [cs.LG])
    A growing body of literature in fairness-aware ML (fairML) aspires to mitigate machine learning (ML)-related unfairness in automated decision making (ADM) by defining metrics that measure fairness of an ML model and by proposing methods that ensure that trained ML models achieve low values in those measures. However, the underlying concept of fairness, i.e., the question of what fairness is, is rarely discussed, leaving a considerable gap between centuries of philosophical discussion and recent adoption of the concept in the ML community. In this work, we try to bridge this gap by formalizing a consistent concept of fairness and by translating the philosophical considerations into a formal framework for the evaluation of ML models in ADM systems. We derive that fairness problems can already arise without the presence of protected attributes, pointing out that fairness and predictive performance are not irreconcilable counterparts, but rather that the latter is necessary to achieve the former. Moreover, we argue why and how causal considerations are necessary when assessing fairness in the presence of protected attributes. Eventually, we achieve greater linguistic clarity for the discussion of fairML by clearly assigning responsibilities to stakeholders inside and outside ML.
    Learning Energy Networks with Generalized Fenchel-Young Losses. (arXiv:2205.09589v1 [cs.LG])
    Energy-based models, a.k.a. energy networks, perform inference by optimizing an energy function, typically parametrized by a neural network. This allows one to capture potentially complex relationships between inputs and outputs. To learn the parameters of the energy function, the solution to that optimization problem is typically fed into a loss function. The key challenge for training energy networks lies in computing loss gradients, as this typically requires argmin/argmax differentiation. In this paper, building upon a generalized notion of conjugate function, which replaces the usual bilinear pairing with a general energy function, we propose generalized Fenchel-Young losses, a natural loss construction for learning energy networks. Our losses enjoy many desirable properties and their gradients can be computed efficiently without argmin/argmax differentiation. We also prove the calibration of their excess risk in the case of linear-concave energies. We demonstrate our losses on multilabel classification and imitation learning tasks.
    Towards a Theory of Faithfulness: Faithful Explanations of Differentiable Classifiers over Continuous Data. (arXiv:2205.09620v1 [cs.LG])
    There is broad agreement in the literature that explanation methods should be faithful to the model that they explain, but faithfulness remains a rather vague term. We revisit faithfulness in the context of continuous data and propose two formal definitions of faithfulness for feature attribution methods. Qualitative faithfulness demands that scores reflect the true qualitative effect (positive vs. negative) of the feature on the model and quanitative faithfulness that the magnitude of scores reflect the true quantitative effect. We discuss under which conditions these requirements can be satisfied to which extent (local vs global). As an application of the conceptual idea, we look at differentiable classifiers over continuous data and characterize Gradient-scores as follows: every qualitatively faithful feature attribution method is qualitatively equivalent to Gradient-scores. Furthermore, if an attribution method is quantitatively faithful in the sense that changes of the output of the classifier are proportional to the scores of features, then it is either equivalent to gradient-scoring or it is based on an inferior approximation of the classifier. To illustrate the practical relevance of the theory, we experimentally demonstrate that popular attribution methods can fail to give faithful explanations in the setting where the data is continuous and the classifier differentiable.
    Posterior Matching for Arbitrary Conditioning. (arXiv:2201.12414v3 [cs.LG] UPDATED)
    Arbitrary conditioning is an important problem in unsupervised learning, where we seek to model the conditional densities $p(\mathbf{x}_u \mid \mathbf{x}_o)$ that underly some data, for all possible non-intersecting subsets $o, u \subset \{1, \dots , d\}$. However, the vast majority of density estimation only focuses on modeling the joint distribution $p(\mathbf{x})$, in which important conditional dependencies between features are opaque. We propose a simple and general framework, coined Posterior Matching, that enables Variational Autoencoders (VAEs) to perform arbitrary conditioning, without modification to the VAE itself. Posterior Matching applies to the numerous existing VAE-based approaches to joint density estimation, thereby circumventing the specialized models required by previous approaches to arbitrary conditioning. We find that Posterior Matching is comparable or superior to current state-of-the-art methods for a variety of tasks with an assortment of VAEs (e.g.~discrete, hierarchical, VaDE).
    Automatic Spoken Language Identification using a Time-Delay Neural Network. (arXiv:2205.09564v1 [cs.CL])
    Closed-set spoken language identification is the task of recognizing the language being spoken in a recorded audio clip from a set of known languages. In this study, a language identification system was built and trained to distinguish between Arabic, Spanish, French, and Turkish based on nothing more than recorded speech. A pre-existing multilingual dataset was used to train a series of acoustic models based on the Tedlium TDNN model to perform automatic speech recognition. The system was provided with a custom multilingual language model and a specialized pronunciation lexicon with language names prepended to phones. The trained model was used to generate phone alignments to test data from all four languages, and languages were predicted based on a voting scheme choosing the most common language prepend in an utterance. Accuracy was measured by comparing predicted languages to known languages, and was determined to be very high in identifying Spanish and Arabic, and somewhat lower in identifying Turkish and French.
    Beyond Greedy Search: Tracking by Multi-Agent Reinforcement Learning-based Beam Search. (arXiv:2205.09676v1 [cs.CV])
    Existing trackers usually select a location or proposal with the maximum score as tracking result for each frame. However, such greedy search scheme maybe not the optimal choice, especially when encountering challenging tracking scenarios like heavy occlusions and fast motion. Since the accumulated errors would make response scores not reliable anymore. In this paper, we propose a novel multi-agent reinforcement learning based beam search strategy (termed BeamTracking) to address this issue. Specifically, we formulate the tracking as a sample selection problem fulfilled by multiple parallel decision-making processes, each of which aims at picking out one sample as their tracking result in each frame. We take the target feature, proposal feature, and its response score as state, and also consider actions predicted by nearby agent, to train multi-agents to select their actions. When all the frames are processed, we select the trajectory with the maximum accumulated score as the tracking result. Extensive experiments on seven popular tracking benchmark datasets validated the effectiveness of the proposed algorithm.
    scICML: Information-theoretic Co-clustering-based Multi-view Learning for the Integrative Analysis of Single-cell Multi-omics data. (arXiv:2205.09523v1 [stat.ML])
    Modern high-throughput sequencing technologies have enabled us to profile multiple molecular modalities from the same single cell, providing unprecedented opportunities to assay celluar heterogeneity from multiple biological layers. However, the datasets generated from these technologies tend to have high level of noise and are highly sparse, bringing challenges to data analysis. In this paper, we develop a novel information-theoretic co-clustering-based multi-view learning (scICML) method for multi-omics single-cell data integration. scICML utilizes co-clusterings to aggregate similar features for each view of data and uncover the common clustering pattern for cells. In addition, scICML automatically matches the clusters of the linked features across different data types for considering the biological dependency structure across different types of genomic features. Our experiments on four real-world datasets demonstrate that scICML improves the overall clustering performance and provides biological insights into the data analysis of peripheral blood mononuclear cells.
    The First Optimal Acceleration of High-Order Methods in Smooth Convex Optimization. (arXiv:2205.09647v1 [math.OC])
    In this paper, we study the fundamental open question of finding the optimal high-order algorithm for solving smooth convex minimization problems. Arjevani et al. (2019) established the lower bound $\Omega\left(\epsilon^{-2/(3p+1)}\right)$ on the number of the $p$-th order oracle calls required by an algorithm to find an $\epsilon$-accurate solution to the problem, where the $p$-th order oracle stands for the computation of the objective function value and the derivatives up to the order $p$. However, the existing state-of-the-art high-order methods of Gasnikov et al. (2019b); Bubeck et al. (2019); Jiang et al. (2019) achieve the oracle complexity $\mathcal{O}\left(\epsilon^{-2/(3p+1)} \log (1/\epsilon)\right)$, which does not match the lower bound. The reason for this is that these algorithms require performing a complex binary search procedure, which makes them neither optimal nor practical. We fix this fundamental issue by providing the first algorithm with $\mathcal{O}\left(\epsilon^{-2/(3p+1)}\right)$ $p$-th order oracle complexity.
    Machine learning applications for noisy intermediate-scale quantum computers. (arXiv:2205.09414v1 [quant-ph])
    Quantum machine learning has proven to be a fruitful area in which to search for potential applications of quantum computers. This is particularly true for those available in the near term, so called noisy intermediate-scale quantum (NISQ) devices. In this Thesis, we develop and study three quantum machine learning applications suitable for NISQ computers, ordered in terms of increasing complexity of data presented to them. These algorithms are variational in nature and use parameterised quantum circuits (PQCs) as the underlying quantum machine learning model. The first application area is quantum classification using PQCs, where the data is classical feature vectors and their corresponding labels. Here, we study the robustness of certain data encoding strategies in such models against noise present in a quantum computer. The second area is generative modelling using quantum computers, where we use quantum circuit Born machines to learn and sample from complex probability distributions. We discuss and present a framework for quantum advantage for such models, propose gradient-based training methods and demonstrate these both numerically and on the Rigetti quantum computer up to 28 qubits. For our final application, we propose a variational algorithm in the area of approximate quantum cloning, where the data becomes quantum in nature. For the algorithm, we derive differentiable cost functions, prove theoretical guarantees such as faithfulness, and incorporate state of the art methods such as quantum architecture search. Furthermore, we demonstrate how this algorithm is useful in discovering novel implementable attacks on quantum cryptographic protocols, focusing on quantum coin flipping and key distribution as examples.
    CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network. (arXiv:2205.09612v1 [cs.LG])
    In this paper, we propose a Classification Confidence Network (CLCNet) that can determine whether the classification model classifies input samples correctly. It can take a classification result in the form of vector in any dimension, and return a confidence score as output, which represents the probability of an instance being classified correctly. We can utilize CLCNet in a simple cascade structure system consisting of several SOTA (state-of-the-art) classification models, and our experiments show that the system can achieve the following advantages: 1. The system can customize the average computation requirement (FLOPs) per image while inference. 2. Under the same computation requirement, the performance of the system can exceed any model that has identical structure with the model in the system, but different in size. In fact, this is a new type of ensemble modeling. Like general ensemble modeling, it can achieve higher performance than single classification model, yet our system requires much less computation than general ensemble modeling. We have uploaded our code to a github repository: https://github.com/yaoching0/CLCNet-Rethinking-of-Ensemble-Modeling.
    Practical Skills Demand Forecasting via Representation Learning of Temporal Dynamics. (arXiv:2205.09508v1 [econ.GN])
    Rapid technological innovation threatens to leave much of the global workforce behind. Today's economy juxtaposes white-hot demand for skilled labor against stagnant employment prospects for workers unprepared to participate in a digital economy. It is a moment of peril and opportunity for every country, with outcomes measured in long-term capital allocation and the life satisfaction of billions of workers. To meet the moment, governments and markets must find ways to quicken the rate at which the supply of skills reacts to changes in demand. More fully and quickly understanding labor market intelligence is one route. In this work, we explore the utility of time series forecasts to enhance the value of skill demand data gathered from online job advertisements. This paper presents a pipeline which makes one-shot multi-step forecasts into the future using a decade of monthly skill demand observations based on a set of recurrent neural network methods. We compare the performance of a multivariate model versus a univariate one, analyze how correlation between skills can influence multivariate model results, and present predictions of demand for a selection of skills practiced by workers in the information technology industry.
    Gold-standard solutions to the Schr\"odinger equation using deep learning: How much physics do we need?. (arXiv:2205.09438v1 [cs.LG])
    Finding accurate solutions to the Schr\"odinger equation is the key unsolved challenge of computational chemistry. Given its importance for the development of new chemical compounds, decades of research have been dedicated to this problem, but due to the large dimensionality even the best available methods do not yet reach the desired accuracy. Recently the combination of deep learning with Monte Carlo methods has emerged as a promising way to obtain highly accurate energies and moderate scaling of computational cost. In this paper we significantly contribute towards this goal by introducing a novel deep-learning architecture that achieves 40-70% lower energy error at 8x lower computational cost compared to previous approaches. Using our method we establish a new benchmark by calculating the most accurate variational ground state energies ever published for a number of different atoms and molecules. We systematically break down and measure our improvements, focusing in particular on the effect of increasing physical prior knowledge. We surprisingly find that increasing the prior knowledge given to the architecture can actually decrease accuracy.
    Spatial Autoregressive Coding for Graph Neural Recommendation. (arXiv:2205.09489v1 [cs.IR])
    Graph embedding methods including traditional shallow models and deep Graph Neural Networks (GNNs) have led to promising applications in recommendation. Nevertheless, shallow models especially random-walk-based algorithms fail to adequately exploit neighbor proximity in sampled subgraphs or sequences due to their optimization paradigm. GNN-based algorithms suffer from the insufficient utilization of high-order information and easily cause over-smoothing problems when stacking too much layers, which may deteriorate the recommendations of low-degree (long-tail) items, limiting the expressiveness and scalability. In this paper, we propose a novel framework SAC, namely Spatial Autoregressive Coding, to solve the above problems in a unified way. To adequately leverage neighbor proximity and high-order information, we design a novel spatial autoregressive paradigm. Specifically, we first randomly mask multi-hop neighbors and embed the target node by integrating all other surrounding neighbors with an explicit multi-hop attention. Then we reinforce the model to learn a neighbor-predictive coding for the target node by contrasting the coding and the masked neighbors' embedding, equipped with a new hard negative sampling strategy. To learn the minimal sufficient representation for the target-to-neighbor prediction task and remove the redundancy of neighbors, we devise Neighbor Information Bottleneck by maximizing the mutual information between target predictive coding and the masked neighbors' embedding, and simultaneously constraining those between the coding and surrounding neighbors' embedding. Experimental results on both public recommendation datasets and a real scenario web-scale dataset Douyin-Friend-Recommendation demonstrate the superiority of SAC compared with state-of-the-art methods.
    Learning-based AC-OPF Solvers on Realistic Network and Realistic Loads. (arXiv:2205.09452v1 [cs.LG])
    Deep learning approaches for the Alternating Current-Optimal Power Flow (AC-OPF) problem are under active research in recent years. A common shortcoming in this area of research is the lack of a dataset that includes both a realistic power network topology and the corresponding realistic loads. To address this issue, we construct an AC-OPF formulation-ready dataset called TAS-97 that contains realistic network information and realistic bus loads from Tasmania's electricity network. We found that the realistic loads in Tasmania are correlated between buses and they show signs of an underlying multivariate normal distribution. Feasibility-optimized end-to-end deep neural network models are trained and tested on the constructed dataset. Trained on samples with bus loads generated from a fitted multivariate normal distribution, our learning-based AC-OPF solver achieves 0.13% cost optimality gap, 99.73% feasibility rate, and 38.62 times of speedup on realistic testing samples when compared to PYPOWER.
    Why only Micro-F1? Class Weighting of Measures for Relation Classification. (arXiv:2205.09460v1 [cs.CL])
    Relation classification models are conventionally evaluated using only a single measure, e.g., micro-F1, macro-F1 or AUC. In this work, we analyze weighting schemes, such as micro and macro, for imbalanced datasets. We introduce a framework for weighting schemes, where existing schemes are extremes, and two new intermediate schemes. We show that reporting results of different weighting schemes better highlights strengths and weaknesses of a model.
    Constraint-Based Causal Structure Learning from Undersampled Graphs. (arXiv:2205.09235v1 [stat.ML])
    Graphical structures estimated by causal learning algorithms from time series data can provide highly misleading causal information if the causal timescale of the generating process fails to match the measurement timescale of the data. Although this problem has been recently recognized, practitioners have limited resources to respond to it, and so must continue using models that they know are likely misleading. Existing methods either (a) require that the difference between causal and measurement timescales is known; or (b) can handle only very small number of random variables when the timescale difference is unknown; or (c) apply to only pairs of variables, though with fewer assumptions about prior knowledge; or (d) return impractically too many solutions. This paper addresses all four challenges. We combine constraint programming with both theoretical insights into the problem structure and prior information about admissible causal interactions. The resulting system provides a practical approach that scales to significantly larger sets (>100) of random variables, does not require precise knowledge of the timescale difference, supports edge misidentification and parametric connection strengths, and can provide the optimum choice among many possible solutions. The cumulative impact of these improvements is gain of multiple orders of magnitude in speed and informativeness.
    Mobility, Communication and Computation Aware Federated Learning for Internet of Vehicles. (arXiv:2205.09529v1 [cs.LG])
    While privacy concerns entice connected and automated vehicles to incorporate on-board federated learning (FL) solutions, an integrated vehicle-to-everything communication with heterogeneous computation power aware learning platform is urgently necessary to make it a reality. Motivated by this, we propose a novel mobility, communication and computation aware online FL platform that uses on-road vehicles as learning agents. Thanks to the advanced features of modern vehicles, the on-board sensors can collect data as vehicles travel along their trajectories, while the on-board processors can train machine learning models using the collected data. To take the high mobility of vehicles into account, we consider the delay as a learning parameter and restrict it to be less than a tolerable threshold. To satisfy this threshold, the central server accepts partially trained models, the distributed roadside units (a) perform downlink multicast beamforming to minimize global model distribution delay and (b) allocate optimal uplink radio resources to minimize local model offloading delay, and the vehicle agents conduct heterogeneous local model training. Using real-world vehicle trace datasets, we validate our FL solutions. Simulation shows that the proposed integrated FL platform is robust and outperforms baseline models. With reasonable local training episodes, it can effectively satisfy all constraints and deliver near ground truth multi-horizon velocity and vehicle-specific power predictions.
    Learning Multiscale Convolutional Dictionaries for Image Reconstruction. (arXiv:2011.12815v3 [cs.CV] UPDATED)
    Convolutional neural networks (CNNs) have been tremendously successful in solving imaging inverse problems. To understand their success, an effective strategy is to construct simpler and mathematically more tractable convolutional sparse coding (CSC) models that share essential ingredients with CNNs. Existing CSC methods, however, underperform leading CNNs in challenging inverse problems. We hypothesize that the performance gap may be attributed in part to how they process images at different spatial scales: While many CNNs use multiscale feature representations, existing CSC models mostly rely on single-scale dictionaries. To close the performance gap, we thus propose a multiscale convolutional dictionary structure. The proposed dictionary structure is derived from the U-Net, arguably the most versatile and widely used CNN for image-to-image learning problems. We show that incorporating the proposed multiscale dictionary in an otherwise standard CSC framework yields performance competitive with state-of-the-art CNNs across a range of challenging inverse problems including CT and MRI reconstruction. Our work thus demonstrates the effectiveness and scalability of the multiscale CSC approach in solving challenging inverse problems.
    ODBO: Bayesian Optimization with Search Space Prescreening for Directed Protein Evolution. (arXiv:2205.09548v1 [q-bio.BM])
    Directed evolution is a versatile technique in protein engineering that mimics the process of natural selection by iteratively alternating between mutagenesis and screening in order to search for sequences that optimize a given property of interest, such as catalytic activity and binding affinity to a specified target. However, the space of possible proteins is too large to search exhaustively in the laboratory, and functional proteins are scarce in the vast sequence space. Machine learning (ML) approaches can accelerate directed evolution by learning to map protein sequences to functions without building a detailed model of the underlying physics, chemistry and biological pathways. Despite the great potentials held by these ML methods, they encounter severe challenges in identifying the most suitable sequences for a targeted function. These failures can be attributed to the common practice of adopting a high-dimensional feature representation for protein sequences and inefficient search methods. To address these issues, we propose an efficient, experimental design-oriented closed-loop optimization framework for protein directed evolution, termed ODBO, which employs a combination of novel low-dimensional protein encoding strategy and Bayesian optimization enhanced with search space prescreening via outlier detection. We further design an initial sample selection strategy to minimize the number of experimental samples for training ML models. We conduct and report four protein directed evolution experiments that substantiate the capability of the proposed framework for finding of the variants with properties of interest. We expect the ODBO framework to greatly reduce the experimental cost and time cost of directed evolution, and can be further generalized as a powerful tool for adaptive experimental design in a broader context.
    An Introduction to Quantum Machine Learning for Engineers. (arXiv:2205.09510v1 [quant-ph])
    In the current noisy intermediate-scale quantum (NISQ) era, quantum machine learning is emerging as a dominant paradigm to program gate-based quantum computers. In quantum machine learning, the gates of a quantum circuit are parametrized, and the parameters are tuned via classical optimization based on data and on measurements of the outputs of the circuit. Parametrized quantum circuits (PQCs) can efficiently address combinatorial optimization problems, implement probabilistic generative models, and carry out inference (classification and regression). This monograph provides a self-contained introduction to quantum machine learning for an audience of engineers with a background in probability and linear algebra. It first describes the necessary background, concepts, and tools necessary to describe quantum operations and measurements. Then, it covers parametrized quantum circuits, the variational quantum eigensolver, as well as unsupervised and supervised quantum machine learning formulations.
    Neural ODEs with Irregular and Noisy Data. (arXiv:2205.09479v1 [cs.LG])
    Measurement noise is an integral part while collecting data of a physical process. Thus, noise removal is necessary to draw conclusions from these data, and it often becomes essential to construct dynamical models using these data. We discuss a methodology to learn differential equation(s) using noisy and irregular sampled measurements. In our methodology, the main innovation can be seen in the integration of deep neural networks with the neural ordinary differential equations (ODEs) approach. Precisely, we aim at learning a neural network that provides (approximately) an implicit representation of the data and an additional neural network that models the vector fields of the dependent variables. We combine these two networks by constraining using neural ODEs. The proposed framework to learn a model describing the vector field is highly effective under noisy measurements. The approach can handle scenarios where dependent variables are not available at the same temporal grid. Moreover, a particular structure, e.g., second-order with respect to time, can easily be incorporated. We demonstrate the effectiveness of the proposed method for learning models using data obtained from various differential equations and present a comparison with the neural ODE method that does not make any special treatment to noise.
    Variational Inference for Bayesian Bridge Regression. (arXiv:2205.09515v1 [stat.ML])
    We study the implementation of Automatic Differentiation Variational inference (ADVI) for Bayesian inference on regression models with bridge penalization. The bridge approach uses $\ell_{\alpha}$ norm, with $\alpha \in (0, +\infty)$ to define a penalization on large values of the regression coefficients, which includes the Lasso ($\alpha = 1$) and ridge $(\alpha = 2)$ penalizations as special cases. Full Bayesian inference seamlessly provides joint uncertainty estimates for all model parameters. Although MCMC aproaches are available for bridge regression, it can be slow for large dataset, specially in high dimensions. The ADVI implementation allows the use of small batches of data at each iteration (due to stochastic gradient based algorithms), therefore speeding up computational time in comparison with MCMC. We illustrate the approach on non-parametric regression models with B-splines, although the method works seamlessly for other choices of basis functions. A simulation study shows the main properties of the proposed method.
    Riemannian Metric Learning via Optimal Transport. (arXiv:2205.09244v1 [cs.LG])
    We introduce an optimal transport-based model for learning a metric tensor from cross-sectional samples of evolving probability measures on a common Riemannian manifold. We neurally parametrize the metric as a spatially-varying matrix field and efficiently optimize our model's objective using backpropagation. Using this learned metric, we can nonlinearly interpolate between probability measures and compute geodesics on the manifold. We show that metrics learned using our method improve the quality of trajectory inference on scRNA and bird migration data at the cost of little additional cross-sectional data.
    Neural Network Architecture Beyond Width and Depth. (arXiv:2205.09459v1 [cs.LG])
    This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyperparameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyperparameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, a numerical example is provided to explore the advantages of the super approximation power of ReLU NestNets.
    Torchhd: An Open-Source Python Library to Support Hyperdimensional Computing Research. (arXiv:2205.09208v1 [cs.LG])
    Hyperdimensional Computing (HDC) is a neuro-inspired computing framework that exploits high-dimensional random vector spaces. HDC uses extremely parallelizable arithmetic to provide computational solutions that balance accuracy, efficiency and robustness. This has proven especially useful in resource-limited scenarios such as embedded systems. The commitment of the scientific community to aggregate and disseminate research in this particularly multidisciplinary field has been fundamental for its advancement. Adding to this effort, we propose Torchhd, a high-performance open-source Python library for HDC. Torchhd seeks to make HDC more accessible and serves as an efficient foundation for research and application development. The easy-to-use library builds on top of PyTorch and features state-of-the-art HDC functionality, clear documentation and implementation examples from notable publications. Comparing publicly available code with their Torchhd implementation shows that experiments can run up to 104$\times$ faster. Torchhd is available at: https://github.com/hyperdimensional-computing/torchhd
    TransTab: Learning Transferable Tabular Transformers Across Tables. (arXiv:2205.09328v1 [cs.LG])
    Tabular data (or tables) are the most widely used data format in machine learning (ML). However, ML models often assume the table structure keeps fixed in training and testing. Before ML modeling, heavy data cleaning is required to merge disparate tables with different columns. This preprocessing often incurs significant data waste (e.g., removing unmatched columns and samples). How to learn ML models from multiple tables with partially overlapping columns? How to incrementally update ML models as more columns become available over time? Can we leverage model pretraining on multiple distinct tables? How to train an ML model which can predict on an unseen table? To answer all those questions, we propose to relax fixed table structures by introducing a Transferable Tabular Transformer (TransTab) for tables. The goal of TransTab is to convert each sample (a row in the table) to a generalizable embedding vector, and then apply stacked transformers for feature encoding. One methodology insight is combining column description and table cells as the raw input to a gated transformer model. The other insight is to introduce supervised and self-supervised pretraining to improve model performance. We compare TransTab with multiple baseline methods on diverse benchmark datasets and five oncology clinical trial datasets. Overall, TransTab ranks 1.00, 1.00, 1.78 out of 12 methods in supervised learning, feature incremental learning, and transfer learning scenarios, respectively; and the proposed pretraining leads to 2.3\% AUC lift on average over the supervised learning.}
    ESCADA: Efficient Safety and Context Aware Dose Allocation for Precision Medicine. (arXiv:2111.13415v2 [cs.LG] UPDATED)
    Finding an optimal individualized treatment regimen is considered one of the most challenging precision medicine problems. Various patient characteristics influence the response to the treatment, and hence, there is no one-size-fits-all regimen. Moreover, the administration of an unsafe dose during the treatment can have adverse effects on health. Therefore, a treatment model must ensure patient \emph{safety} while \emph{efficiently} optimizing the course of therapy. We study a prevalent medical problem where the treatment aims to keep a physiological variable in a safe range and preferably close to a target level, which we refer to as \emph{leveling}. Such a task may be relevant in numerous other domains as well. We propose ESCADA, a novel and generic multi-armed bandit (MAB) algorithm tailored for the leveling task, to make safe, personalized, and context-aware dose recommendations. We derive high probability upper bounds on its cumulative regret and safety guarantees. Following ESCADA's design, we also describe its Thompson sampling-based counterpart. We discuss why the straightforward adaptations of the classical MAB algorithms such as GP-UCB may not be a good fit for the leveling task. Finally, we make \emph{in silico} experiments on the bolus-insulin dose allocation problem in type-1 diabetes mellitus disease and compare our algorithms against the famous GP-UCB algorithm, the rule-based dose calculators, and a clinician.  ( 2 min )
    Deep Fusion Prior for Multi-Focus Image Super Resolution Fusion. (arXiv:2110.05706v3 [cs.CV] UPDATED)
    This paper unifies the multi-focus images fusion (MFIF) and blind super resolution (SR) problems as the multi-focus image super resolution fusion (MFISRF) task, and proposes a novel unified dataset-free unsupervised framework named deep fusion prior (DFP) to address such MFISRF task. DFP consists of SKIPnet network, DoubleReblur focus measurement tactic, decision embedding module and loss functions. In particular, DFP can obtain MFISRF only from two low-resolution inputs without any extent dataset; SKIPnet implementing unsupervised learning via deep image prior is an end-to-end generated network acting as the engine of DFP; DoubleReblur is used to determine the primary decision map without learning but based on estimated PSF and Gaussian kernels convolution; decision embedding module optimizes the decision map via learning; and DFP losses composed of content loss, joint gradient loss and gradient limit loss can obtain high-quality MFISRF results robustly. Experiments have proved that our proposed DFP approaches and even outperforms those state-of-art MFIF and SR method combinations. Additionally, DFP is a general framework, thus its networks and focus measurement tactics can be continuously updated to further improve the MFISRF performance. DFP codes are open source and will be available soon at this http URL  ( 2 min )
    CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm. (arXiv:2103.00858v4 [cs.DB] UPDATED)
    Learned indexes, which use machine learning models to replace traditional index structures, have shown promising results in recent studies. However, existing learned indexes exhibit a performance gap between synthetic and real-world datasets, making them far from practical indexes. In this paper, we identify that ignoring the importance of data partitioning during model training is the main reason for this problem. Thus, we explicitly apply data partitioning to index construction and propose a new efficient and updatable cache-aware RMI framework, called CARMI. Specifically, we introduce entropy as a metric to quantify and characterize the effectiveness of data partitioning of tree nodes in learned indexes and propose a novel cost model, laying a new theoretical foundation for future research. Then, based on our novel cost model, CARMI can automatically determine tree structures and model types under various datasets and workloads by a hybrid construction algorithm without any manual tuning. Furthermore, since memory accesses limit the performance of RMIs, a new cache-aware design is also applied in CARMI, which makes full use of the characteristics of the CPU cache to effectively reduce the number of memory accesses. Our experimental study shows that CARMI performs better than baselines, achieving an average of 2.2x/1.9x speedup compared to B+ Tree/ALEX, while using only about 0.77x memory space of B+ Tree. On the SOSD platform, CARMI outperforms all baselines, with an average speedup of 1.2x over the nearest competitor RMI, which has been carefully tuned for each dataset in advance.  ( 2 min )
    Denoising Noisy Neural Networks: A Bayesian Approach with Compensation. (arXiv:2105.10699v3 [cs.LG] UPDATED)
    Deep neural networks (DNNs) with noisy weights, which we refer to as noisy neural networks (NoisyNNs), arise from the training and inference of DNNs in the presence of noise. NoisyNNs emerge in many new applications, including the wireless transmission of DNNs, the efficient deployment or storage of DNNs in analog devices, and the truncation or quantization of DNN weights. This paper studies a fundamental problem of NoisyNNs: how to reconstruct the DNN weights from their noisy manifestations. While all prior works relied on the maximum likelihood (ML) estimation, this paper puts forth a denoising approach to reconstruct DNNs with the aim of maximizing the inference accuracy of the reconstructed models. The superiority of our denoiser is rigorously proven in two small-scale problems, wherein we consider a quadratic neural network function and a shallow feedforward neural network, respectively. When applied to advanced learning tasks with modern DNN architectures, our denoiser exhibits significantly better performance than the ML estimator. Consider the average test accuracy of the denoised DNN model versus the weight variance to noise power ratio (WNR) performance. When denoising a noisy ResNet34 model arising from noisy inference, our denoiser outperforms ML estimation by up to 4.1 dB to achieve a test accuracy of 60%.When denoising a noisy ResNet18 model arising from noisy training, our denoiser outperforms ML estimation by 13.4 dB and 8.3 dB to achieve test accuracies of 60% and 80%, respectively.  ( 2 min )
    Privacy preserving n-party scalar product protocol. (arXiv:2112.09436v2 [cs.CR] UPDATED)
    Privacy-preserving machine learning enables the training of models on decentralized datasets without the need to reveal the data, both on horizontal and vertically partitioned data. However, it relies on specialized techniques and algorithms to perform the necessary computations. The privacy preserving scalar product protocol, which enables the dot product of vectors without revealing them, is one popular example for its versatility. Unfortunately, the solutions currently proposed in the literature focus mainly on two-party scenarios, even though scenarios with a higher number of data parties are becoming more relevant. For example when performing analyses that require counting the number of samples which fulfill certain criteria defined across various sites, such as calculating the information gain at a node in a decision tree. In this paper we propose a generalization of the protocol for an arbitrary number of parties, based on an existing two-party method. Our proposed solution relies on a recursive resolution of smaller scalar products. After describing our proposed method, we discuss potential scalability issues. Finally, we describe the privacy guarantees and identify any concerns, as well as comparing the proposed method to the original solution in this aspect.  ( 2 min )
    Parameter-free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients. (arXiv:2109.11788v3 [cs.LG] UPDATED)
    Approximation of the value functions in value-based deep reinforcement learning induces overestimation bias, resulting in suboptimal policies. We show that when the reinforcement signals received by the agents have a high variance, deep actor-critic approaches that overcome the overestimation bias lead to a substantial underestimation bias. We first address the detrimental issues in the existing approaches that aim to overcome such underestimation error. Then, through extensive statistical analysis, we introduce a novel, parameter-free Deep Q-learning variant to reduce this underestimation bias in deterministic policy gradients. By sampling the weights of a linear combination of two approximate critics from a highly shrunk estimation bias interval, our Q-value update rule is not affected by the variance of the rewards received by the agents throughout learning. We test the performance of the introduced improvement on a set of MuJoCo and Box2D continuous control tasks and demonstrate that it considerably outperforms the existing approaches and improves the state-of-the-art by a significant margin.  ( 2 min )
    CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning. (arXiv:2112.04564v3 [cs.CV] UPDATED)
    In this paper, we propose a novel co-learning framework (CoSSL) with decoupled representation learning and classifier learning for imbalanced SSL. To handle the data imbalance, we devise Tail-class Feature Enhancement (TFE) for classifier learning. Furthermore, the current evaluation protocol for imbalanced SSL focuses only on balanced test sets, which has limited practicality in real-world scenarios. Therefore, we further conduct a comprehensive evaluation under various shifted test distributions. In experiments, we show that our approach outperforms other methods over a large range of shifted distributions, achieving state-of-the-art performance on benchmark datasets ranging from CIFAR-10, CIFAR-100, ImageNet, to Food-101. Our code will be made publicly available.  ( 2 min )
    M3E2: Multi-gate Mixture-of-experts for Multi-treatment Effect Estimation. (arXiv:2112.07574v2 [cs.LG] UPDATED)
    This work proposes the M3E2, a multi-task learning neural network model to estimate the effect of multiple treatments. In contrast to existing methods, M3E2 can handle multiple treatment effects applied simultaneously to the same unit, continuous and binary treatments, and many covariates. We compared M3E2 with three baselines in three synthetic benchmark datasets: two with multiple treatments and one with one treatment. Our analysis showed that our method has superior performance, making more assertive estimations of the multiple treatment effects.  ( 2 min )
    DBSegment: Fast and robust segmentation of deep brain structures -- Evaluation of transportability across acquisition domains. (arXiv:2110.09473v3 [eess.IV] UPDATED)
    Segmenting deep brain structures from magnetic resonance images is important for patient diagnosis, surgical planning, and research. Most current state-of-the-art solutions follow a segmentation-by-registration approach, where subject MRIs are mapped to a template with well-defined segmentations. However, registration-based pipelines are time-consuming, thus, limiting their clinical use. This paper uses deep learning to provide a robust and efficient deep brain segmentation solution. The method consists of a pre-processing step to conform all MRI images to the same orientation, followed by a convolutional neural network using the nnU-Net framework. We use a total of 14 datasets from both research and clinical collections. Of these, seven were used for training and validation and seven were retained for independent testing. We trained the network to segment 30 deep brain structures, as well as a brain mask, using labels generated from a registration-based approach. We evaluated the generalizability of the network by performing a leave-one-dataset-out cross-validation, and extensive testing on external datasets. Furthermore, we assessed cross-domain transportability by evaluating the results separately on different domains. We achieved an average DSC of 0.89 $\pm$ 0.04 on the independent testing datasets when compared to the registration-based gold standard. On our test system, the computation time decreased from 42 minutes for a reference registration-based pipeline to 1 minute. Our proposed method is fast, robust, and generalizes with high reliability. It can be extended to the segmentation of other brain structures. The method is publicly available on GitHub, as well as a pip package for convenient usage.  ( 3 min )
    Foundation Posteriors for Approximate Probabilistic Inference. (arXiv:2205.09735v1 [cs.LG])
    Probabilistic programs provide an expressive representation language for generative models. Given a probabilistic program, we are interested in the task of posterior inference: estimating a latent variable given a set of observed variables. Existing techniques for inference in probabilistic programs often require choosing many hyper-parameters, are computationally expensive, and/or only work for restricted classes of programs. Here we formulate inference as masked language modeling: given a program, we generate a supervised dataset of variables and assignments, and randomly mask a subset of the assignments. We then train a neural network to unmask the random values, defining an approximate posterior distribution. By optimizing a single neural network across a range of programs we amortize the cost of training, yielding a ``foundation'' posterior able to do zero-shot inference for new programs. The foundation posterior can also be fine-tuned for a particular program and dataset by optimizing a variational inference objective. We show the efficacy of the approach, zero-shot and fine-tuned, on a benchmark of STAN programs.  ( 2 min )
    A Mutually Exciting Latent Space Hawkes Process Model for Continuous-time Networks. (arXiv:2205.09263v1 [cs.LG])
    Networks and temporal point processes serve as fundamental building blocks for modeling complex dynamic relational data in various domains. We propose the latent space Hawkes (LSH) model, a novel generative model for continuous-time networks of relational events, using a latent space representation for nodes. We model relational events between nodes using mutually exciting Hawkes processes with baseline intensities dependent upon the distances between the nodes in the latent space and sender and receiver specific effects. We propose an alternating minimization algorithm to jointly estimate the latent positions of the nodes and other model parameters. We demonstrate that our proposed LSH model can replicate many features observed in real temporal networks including reciprocity and transitivity, while also achieves superior prediction accuracy and provides more interpretability compared to existing models.
    Deep Learning in Business Analytics: A Clash of Expectations and Reality. (arXiv:2205.09337v1 [cs.LG])
    Our fast-paced digital economy shaped by global competition requires increased data-driven decision-making based on artificial intelligence (AI) and machine learning (ML). The benefits of deep learning (DL) are manifold, but it comes with limitations that have - so far - interfered with widespread industry adoption. This paper explains why DL - despite its popularity - has difficulties speeding up its adoption within business analytics. It is shown - by a mixture of content analysis and empirical study - that the adoption of deep learning is not only affected by computational complexity, lacking big data architecture, lack of transparency (black-box), and skill shortage, but also by the fact that DL does not outperform traditional ML models in the case of structured datasets with fixed-length feature vectors. Deep learning should be regarded as a powerful addition to the existing body of ML models instead of a one size fits all solution.
    Darts: User-Friendly Modern Machine Learning for Time Series. (arXiv:2110.03224v3 [cs.LG] UPDATED)
    We present Darts, a Python machine learning library for time series, with a focus on forecasting. Darts offers a variety of models, from classics such as ARIMA to state-of-the-art deep neural networks. The emphasis of the library is on offering modern machine learning functionalities, such as supporting multidimensional series, meta-learning on multiple series, training on large datasets, incorporating external data, ensembling models, and providing a rich support for probabilistic forecasting. At the same time, great care goes into the API design to make it user-friendly and easy to use. For instance, all models can be used using fit()/predict(), similar to scikit-learn.  ( 2 min )
    Bypassing Logits Bias in Online Class-Incremental Learning with a Generative Framework. (arXiv:2205.09347v1 [cs.LG])
    Continual learning requires the model to maintain the learned knowledge while learning from a non-i.i.d data stream continually. Due to the single-pass training setting, online continual learning is very challenging, but it is closer to the real-world scenarios where quick adaptation to new data is appealing. In this paper, we focus on online class-incremental learning setting in which new classes emerge over time. Almost all existing methods are replay-based with a softmax classifier. However, the inherent logits bias problem in the softmax classifier is a main cause of catastrophic forgetting while existing solutions are not applicable for online settings. To bypass this problem, we abandon the softmax classifier and propose a novel generative framework based on the feature space. In our framework, a generative classifier which utilizes replay memory is used for inference, and the training objective is a pair-based metric learning loss which is proven theoretically to optimize the feature space in a generative way. In order to improve the ability to learn new data, we further propose a hybrid of generative and discriminative loss to train the model. Extensive experiments on several benchmarks, including newly introduced task-free datasets, show that our method beats a series of state-of-the-art replay-based methods with discriminative classifiers, and reduces catastrophic forgetting consistently with a remarkable margin.
    IL-flOw: Imitation Learning from Observation using Normalizing Flows. (arXiv:2205.09251v1 [cs.LG])
    We present an algorithm for Inverse Reinforcement Learning (IRL) from expert state observations only. Our approach decouples reward modelling from policy learning, unlike state-of-the-art adversarial methods which require updating the reward model during policy search and are known to be unstable and difficult to optimize. Our method, IL-flOw, recovers the expert policy by modelling state-state transitions, by generating rewards using deep density estimators trained on the demonstration trajectories, avoiding the instability issues of adversarial methods. We demonstrate that using the state transition log-probability density as a reward signal for forward reinforcement learning translates to matching the trajectory distribution of the expert demonstrations, and experimentally show good recovery of the true reward signal as well as state of the art results for imitation from observation on locomotion and robotic continuous control tasks.  ( 2 min )
    Continual Pre-Training Mitigates Forgetting in Language and Vision. (arXiv:2205.09357v1 [cs.LG])
    Pre-trained models are nowadays a fundamental component of machine learning research. In continual learning, they are commonly used to initialize the model before training on the stream of non-stationary data. However, pre-training is rarely applied during continual learning. We formalize and investigate the characteristics of the continual pre-training scenario in both language and vision environments, where a model is continually pre-trained on a stream of incoming data and only later fine-tuned to different downstream tasks. We show that continually pre-trained models are robust against catastrophic forgetting and we provide strong empirical evidence supporting the fact that self-supervised pre-training is more effective in retaining previous knowledge than supervised protocols. Code is provided at https://github.com/AndreaCossu/continual-pretraining-nlp-vision .  ( 2 min )
    Consistent Interpolating Ensembles via the Manifold-Hilbert Kernel. (arXiv:2205.09342v1 [stat.ML])
    Recent research in the theory of overparametrized learning has sought to establish generalization guarantees in the interpolating regime. Such results have been established for a few common classes of methods, but so far not for ensemble methods. We devise an ensemble classification method that simultaneously interpolates the training data, and is consistent for a broad class of data distributions. To this end, we define the manifold-Hilbert kernel for data distributed on a Riemannian manifold. We prove that kernel smoothing regression using the manifold-Hilbert kernel is weakly consistent in the setting of Devroye et al. 1998. For the sphere, we show that the manifold-Hilbert kernel can be realized as a weighted random partition kernel, which arises as an infinite ensemble of partition-based classifiers.  ( 2 min )
    SiReN: Sign-Aware Recommendation Using Graph Neural Networks. (arXiv:2108.08735v2 [cs.IR] UPDATED)
    In recent years, many recommender systems using network embedding (NE) such as graph neural networks (GNNs) have been extensively studied in the sense of improving recommendation accuracy. However, such attempts have focused mostly on utilizing only the information of positive user-item interactions with high ratings. Thus, there is a challenge on how to make use of low rating scores for representing users' preferences since low ratings can be still informative in designing NE-based recommender systems. In this study, we present SiReN, a new sign-aware recommender system based on GNN models. Specifically, SiReN has three key components: 1) constructing a signed bipartite graph for more precisely representing users' preferences, which is split into two edge-disjoint graphs with positive and negative edges each, 2) generating two embeddings for the partitioned graphs with positive and negative edges via a GNN model and a multi-layer perceptron (MLP), respectively, and then using an attention model to obtain the final embeddings, and 3) establishing a sign-aware Bayesian personalized ranking (BPR) loss function in the process of optimization. Through comprehensive experiments, we empirically demonstrate that SiReN consistently outperforms state-of-the-art NE-aided recommendation methods.
    Semi-Supervised Learning for Image Classification using Compact Networks in the BioMedical Context. (arXiv:2205.09678v1 [cs.CV])
    The development of mobile and on the edge applications that embed deep convolutional neural models has the potential to revolutionise biomedicine. However, most deep learning models require computational resources that are not available in smartphones or edge devices; an issue that can be faced by means of compact models. The problem with such models is that they are, at least usually, less accurate than bigger models. In this work, we study how this limitation can be addressed with the application of semi-supervised learning techniques. We conduct several statistical analyses to compare performance of deep compact architectures when trained using semi-supervised learning methods for tackling image classification tasks in the biomedical context. In particular, we explore three families of compact networks, and two families of semi-supervised learning techniques for 10 biomedical tasks. By combining semi-supervised learning methods with compact networks, it is possible to obtain a similar performance to standard size networks. In general, the best results are obtained when combining data distillation with MixNet, and plain distillation with ResNet-18. Also, in general, NAS networks obtain better results than manually designed networks and quantized networks. The work presented in this paper shows the benefits of apply semi-supervised methods to compact networks; this allow us to create compact models that are not only as accurate as standard size models, but also faster and lighter. Finally, we have developed a library that simplifies the construction of compact models using semi-supervised learning methods.
    A Graph Data Augmentation Strategy with Entropy Preservation. (arXiv:2107.06048v2 [cs.LG] UPDATED)
    The Graph Convolutional Networks (GCN) proposed by Kipf and Welling is an effective model for semi-supervised learning, but faces the obstacle of over-smoothing, which will weaken the representation ability of GCN. Recently some works are proposed to tackle above limitation by randomly perturbing graph topology or feature matrix to generate data augmentations as input for training. However, these operations inevitably do damage to the integrity of information structures and have to sacrifice the smoothness of feature manifold. In this paper, we first introduce a novel graph entropy definition as a measure to quantitatively evaluate the smoothness of a data manifold and then point out that this graph entropy is controlled by triangle motif-based information structures. Considering the preservation of graph entropy, we propose an effective strategy to generate randomly perturbed training data but maintain both graph topology and graph entropy. Extensive experiments have been conducted on real-world datasets and the results verify the effectiveness of our proposed method in improving semi-supervised node classification accuracy compared with a surge of baselines. Beyond that, our proposed approach could significantly enhance the robustness of training process for GCN.
    RankGen: Improving Text Generation with Large Ranking Models. (arXiv:2205.09726v1 [cs.CL])
    Given an input sequence (or prefix), modern language models often assign high probabilities to output sequences that are repetitive, incoherent, or irrelevant to the prefix; as such, model-generated text also contains such artifacts. To address these issues, we present RankGen, an encoder model (1.2B parameters) that scores model generations given a prefix. RankGen can be flexibly incorporated as a scoring function in beam search and used to decode from any pretrained language model. We train RankGen using large-scale contrastive learning to map a prefix close to the ground-truth sequence that follows it and far away from two types of negatives: (1) random sequences from the same document as the prefix, and, which discourage topically-similar but irrelevant generations; (2) sequences generated from a large language model conditioned on the prefix, which discourage repetition and hallucination. Experiments across four different language models (345M-11B parameters) and two domains show that RankGen significantly outperforms decoding algorithms like nucleus, top-k, and typical sampling on both automatic metrics (85.0 vs 77.3 MAUVE) as well as human evaluations with English writers (74.5% human preference over nucleus sampling). Analysis reveals that RankGen outputs are more relevant to the prefix and improve continuity and coherence compared to baselines. We open source our model checkpoints, code, and human preferences with detailed explanations for future research.
    Differentially private Riemannian optimization. (arXiv:2205.09494v1 [math.OC])
    In this paper, we study the differentially private empirical risk minimization problem where the parameter is constrained to a Riemannian manifold. We introduce a framework of differentially private Riemannian optimization by adding noise to the Riemannian gradient on the tangent space. The noise follows a Gaussian distribution intrinsically defined with respect to the Riemannian metric. We adapt the Gaussian mechanism from the Euclidean space to the tangent space compatible to such generalized Gaussian distribution. We show that this strategy presents a simple analysis as compared to directly adding noise on the manifold. We further show privacy guarantees of the proposed differentially private Riemannian (stochastic) gradient descent using an extension of the moments accountant technique. Additionally, we prove utility guarantees under geodesic (strongly) convex, general nonconvex objectives as well as under the Riemannian Polyak-{\L}ojasiewicz condition. We show the efficacy of the proposed framework in several applications.
    AI-assisted Optimization of the ECCE Tracking System at the Electron Ion Collider. (arXiv:2205.09185v1 [physics.ins-det])
    The Electron-Ion Collider (EIC) is a cutting-edge accelerator facility that will study the nature of the "glue" that binds the building blocks of the visible matter in the universe. The proposed experiment will be realized at Brookhaven National Laboratory in approximately 10 years from now, with detector design and R&D currently ongoing. Notably, EIC is one of the first large-scale facilities to leverage Artificial Intelligence (AI) already starting from the design and R&D phases. The EIC Comprehensive Chromodynamics Experiment (ECCE) is a consortium that proposed a detector design based on a 1.5T solenoid. The EIC detector proposal review concluded that the ECCE design will serve as the reference design for an EIC detector. Herein we describe a comprehensive optimization of the ECCE tracker using AI. The work required a complex parametrization of the simulated detector system. Our approach dealt with an optimization problem in a multidimensional design space driven by multiple objectives that encode the detector performance, while satisfying several mechanical constraints. We describe our strategy and show results obtained for the ECCE tracking system. The AI-assisted design is agnostic to the simulation framework and can be extended to other sub-detectors or to a system of sub-detectors to further optimize the performance of the EIC detector.  ( 4 min )
    Fast matrix multiplication for binary and ternary CNNs on ARM CPU. (arXiv:2205.09120v1 [cs.LG])
    Low-bit quantized neural networks are of great interest in practical applications because they significantly reduce the consumption of both memory and computational resources. Binary neural networks are memory and computationally efficient as they require only one bit per weight and activation and can be computed using Boolean logic and bit count operations. QNNs with ternary weights and activations and binary weights and ternary activations aim to improve recognition quality compared to BNNs while preserving low bit-width. However, their efficient implementation is usually considered on ASICs and FPGAs, limiting their applicability in real-life tasks. At the same time, one of the areas where efficient recognition is most in demand is recognition on mobile devices using their CPUs. However, there are no known fast implementations of TBNs and TNN, only the daBNN library for BNNs inference. In this paper, we propose novel fast algorithms of ternary, ternary-binary, and binary matrix multiplication for mobile devices with ARM architecture. In our algorithms, ternary weights are represented using 2-bit encoding and binary - using one bit. It allows us to replace matrix multiplication with Boolean logic operations that can be computed on 128-bits simultaneously, using ARM NEON SIMD extension. The matrix multiplication results are accumulated in 16-bit integer registers. We also use special reordering of values in left and right matrices. All that allows us to efficiently compute a matrix product while minimizing the number of loads and stores compared to the algorithm from daBNN. Our algorithms can be used to implement inference of convolutional and fully connected layers of TNNs, TBNs, and BNNs. We evaluate them experimentally on ARM Cortex-A73 CPU and compare their inference speed to efficient implementations of full-precision, 8-bit, and 4-bit quantized matrix multiplications.  ( 2 min )
    A2C is a special case of PPO. (arXiv:2205.09123v1 [cs.LG])
    Advantage Actor-critic (A2C) and Proximal Policy Optimization (PPO) are popular deep reinforcement learning algorithms used for game AI in recent years. A common understanding is that A2C and PPO are separate algorithms because PPO's clipped objective appears significantly different than A2C's objective. In this paper, however, we show A2C is a special case of PPO. We present theoretical justifications and pseudocode analysis to demonstrate why. To validate our claim, we conduct an empirical experiment using \texttt{Stable-baselines3}, showing A2C and PPO produce the \textit{exact} same models when other settings are controlled.  ( 2 min )
    LeRaC: Learning Rate Curriculum. (arXiv:2205.09180v1 [cs.LG])
    Most curriculum learning methods require an approach to sort the data samples by difficulty, which is often cumbersome to perform. In this work, we propose a novel curriculum learning approach termed Learning Rate Curriculum (LeRaC), which leverages the use of a different learning rate for each layer of a neural network to create a data-free curriculum during the initial training epochs. More specifically, LeRaC assigns higher learning rates to neural layers closer to the input, gradually decreasing the learning rates as the layers are placed farther away from the input. The learning rates increase at various paces during the first training iterations, until they all reach the same value. From this point on, the neural model is trained as usual. This creates a model-level curriculum learning strategy that does not require sorting the examples by difficulty and is compatible with any neural network, generating higher performance levels regardless of the architecture. We conduct comprehensive experiments on eight datasets from the computer vision (CIFAR-10, CIFAR-100, Tiny ImageNet), language (BoolQ, QNLI, RTE) and audio (ESC-50, CREMA-D) domains, considering various convolutional (ResNet-18, Wide-ResNet-50, DenseNet-121), recurrent (LSTM) and transformer (CvT, BERT, SepTr) architectures, comparing our approach with the conventional training regime. Moreover, we also compare with Curriculum by Smoothing (CBS), a state-of-the-art data-free curriculum learning approach. Unlike CBS, our performance improvements over the standard training regime are consistent across all datasets and models. Furthermore, we significantly surpass CBS in terms of training time (there is no additional cost over the standard training regime for LeRaC).  ( 2 min )
    Routing and Placement of Macros using Deep Reinforcement Learning. (arXiv:2205.09289v1 [cs.LG])
    Chip placement has been one of the most time consuming task in any semi conductor area, Due to this negligence, many projects are pushed and chips availability in real markets get delayed. An engineer placing macros on a chip also needs to place it optimally to reduce the three important factors like power, performance and time. Looking at these prior problems we wanted to introduce a new method using Reinforcement Learning where we train the model to place the nodes of a chip netlist onto a chip canvas. We want to build a neural architecture that will accurately reward the agent across a wide variety of input netlist correctly.  ( 2 min )
    On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning. (arXiv:2205.09121v1 [cs.LG])
    While first-order methods are popular for solving optimization problems that arise in large-scale deep learning problems, they come with some acute deficiencies. To diminish such shortcomings, there has been recent interest in applying second-order methods such as quasi-Newton based methods which construct Hessians approximations using only gradient information. The main focus of our work is to study the behaviour of stochastic quasi-Newton algorithms for training deep neural networks. We have analyzed the performance of two well-known quasi-Newton updates, the limited memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) and the Symmetric Rank One (SR1). This study fills a gap concerning the real performance of both updates and analyzes whether more efficient training is obtained when using the more robust BFGS update or the cheaper SR1 formula which allows for indefinite Hessian approximations and thus can potentially help to better navigate the pathological saddle points present in the non-convex loss functions found in deep learning. We present and discuss the results of an extensive experimental study which includes the effect of batch normalization and network's architecture, the limited memory parameter, the batch size and the type of sampling strategy. we show that stochastic quasi-Newton optimizers are efficient and able to outperform in some instances the well-known first-order Adam optimizer run with the optimal combination of its numerous hyperparameters.  ( 2 min )
    BabyNet: Residual Transformer Module for Birth Weight Prediction on Fetal Ultrasound Video. (arXiv:2205.09382v1 [eess.IV])
    Predicting fetal weight at birth is an important aspect of perinatal care, particularly in the context of antenatal management, which includes the planned timing and the mode of delivery. Accurate prediction of weight using prenatal ultrasound is challenging as it requires images of specific fetal body parts during advanced pregnancy which is difficult to capture due to poor quality of images caused by the lack of amniotic fluid. As a consequence, predictions which rely on standard methods often suffer from significant errors. In this paper we propose the Residual Transformer Module which extends a 3D ResNet-based network for analysis of 2D+t spatio-temporal ultrasound video scans. Our end-to-end method, called BabyNet, automatically predicts fetal birth weight based on fetal ultrasound video scans. We evaluate BabyNet using a dedicated clinical set comprising 225 2D fetal ultrasound videos of pregnancies from 75 patients performed one day prior to delivery. Experimental results show that BabyNet outperforms several state-of-the-art methods and estimates the weight at birth with accuracy comparable to human experts. Furthermore, combining estimates provided by human experts with those computed by BabyNet yields the best results, outperforming either of other methods by a significant margin. The source code of BabyNet is available at https://github.com/SanoScience/BabyNet.  ( 2 min )
    Stochastic uncertainty analysis of gravity gradient tensor components and their combinations. (arXiv:2205.09159v1 [physics.geo-ph])
    Full tensor gravity (FTG) devices provide up to five independent components of the gravity gradient tensor. However, we do not yet have a quantitative understanding of which tensor components or combinations of components are more important to recover a subsurface density model by gravity inversion. This is mainly because different components may be more appropriate in different scenarios or purposes. Knowledge of these components in different environments can aid with selection of optimal selection of component combinations. In this work, we propose to apply stochastic inversion to assess the uncertainty of gravity gradient tensor components and their combinations. The method is therefore a quantitative approach. The applied method here is based on the geostatistical inversion (Gaussian process regression) concept using cokriging. The cokriging variances (variance function of the GP) are found to be a useful indicator for distinguishing the gravity gradient tensor components. This approach is applied to the New Found dataset to demonstrate its effectiveness in real-world applications.  ( 2 min )
    A False Sense of Security? Revisiting the State of Machine Learning-Based Industrial Intrusion Detection. (arXiv:2205.09199v1 [cs.CR])
    Anomaly-based intrusion detection promises to detect novel or unknown attacks on industrial control systems by modeling expected system behavior and raising corresponding alarms for any deviations.As manually creating these behavioral models is tedious and error-prone, research focuses on machine learning to train them automatically, achieving detection rates upwards of 99%. However, these approaches are typically trained not only on benign traffic but also on attacks and then evaluated against the same type of attack used for training. Hence, their actual, real-world performance on unknown (not trained on) attacks remains unclear. In turn, the reported near-perfect detection rates of machine learning-based intrusion detection might create a false sense of security. To assess this situation and clarify the real potential of machine learning-based industrial intrusion detection, we develop an evaluation methodology and examine multiple approaches from literature for their performance on unknown attacks (excluded from training). Our results highlight an ineffectiveness in detecting unknown attacks, with detection rates dropping to between 3.2% and 14.7% for some types of attacks. Moving forward, we derive recommendations for further research on machine learning-based approaches to ensure clarity on their ability to detect unknown attacks.  ( 2 min )
    DDXPlus: A new Dataset for Medical Automatic Diagnosis. (arXiv:2205.09148v1 [cs.CL])
    There has been rapidly growing interests in Automatic Diagnosis (AD) and Automatic Symptom Detection (ASD) systems in the machine learning research literature, aiming to assist doctors in telemedicine services. These systems are designed to interact with patients, collect evidence relevant to their concerns, and make predictions about the underlying diseases. Doctors would review the interaction, including the evidence and the predictions, before making their final decisions. Despite the recent progress, an important piece of doctors' interactions with patients is missing in the design of AD and ASD systems, namely the differential diagnosis. Its absence is largely due to the lack of datasets that include such information for models to train on. In this work, we present a large-scale synthetic dataset that includes a differential diagnosis, along with the ground truth pathology, for each patient. In addition, this dataset includes more pathologies, as well as types of symtoms and antecedents. As a proof-of-concept, we extend several existing AD and ASD systems to incorporate differential diagnosis, and provide empirical evidence that using differentials in training signals is essential for such systems to learn to predict differentials. Dataset available at https://github.com/bruzwen/ddxplus  ( 2 min )
    Computing the ensemble spread from deterministic weather predictions using conditional generative adversarial networks. (arXiv:2205.09182v1 [cs.LG])
    Ensemble prediction systems are an invaluable tool for weather forecasting. Practically, ensemble predictions are obtained by running several perturbations of the deterministic control forecast. However, ensemble prediction is associated with a high computational cost and often involves statistical post-processing steps to improve its quality. Here we propose to use deep-learning-based algorithms to learn the statistical properties of an ensemble prediction system, the ensemble spread, given only the deterministic control forecast. Thus, once trained, the costly ensemble prediction system will not be needed anymore to obtain future ensemble forecasts, and the statistical properties of the ensemble can be derived from a single deterministic forecast. We adapt the classical pix2pix architecture to a three-dimensional model and also experiment with a shared latent space encoder-decoder model, and train them against several years of operational (ensemble) weather forecasts for the 500 hPa geopotential height. The results demonstrate that the trained models indeed allow obtaining a highly accurate ensemble spread from the control forecast only.  ( 2 min )
    MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes. (arXiv:2205.09248v1 [cs.SD])
    We propose a mesh-based neural network (MESH2IR) to generate acoustic impulse responses (IRs) for indoor 3D scenes represented using a mesh. The IRs are used to create a high-quality sound experience in interactive applications and audio processing. Our method can handle input triangular meshes with arbitrary topologies (2K - 3M triangles). We present a novel training technique to train MESH2IR using energy decay relief and highlight its benefits. We also show that training MESH2IR on IRs preprocessed using our proposed technique significantly improves the accuracy of IR generation. We reduce the non-linearity in the mesh space by transforming 3D scene meshes to latent space using a graph convolution network. Our MESH2IR is more than 200 times faster than a geometric acoustic algorithm on a CPU and can generate more than 10,000 IRs per second on an NVIDIA GeForce RTX 2080 Ti GPU for a given furnished indoor 3D scene. The acoustic metrics are used to characterize the acoustic environment. We show that the acoustic metrics of the IRs predicted from our MESH2IR match the ground truth with less than 10% error. We also highlight the benefits of MESH2IR on audio and speech processing applications such as speech dereverberation and speech separation. To the best of our knowledge, ours is the first neural-network-based approach to predict IRs from a given 3D scene mesh in real-time.  ( 2 min )
    Transformer-based Program Synthesis for Low-Data Environments. (arXiv:2205.09246v1 [cs.PL])
    Recent advancements in large pre-trained transformer models (GPT2/3, T5) have found use in program synthesis to generate programs that satisfy a set of input/output examples. However, these models perform poorly on long-horizon and low-data tasks, and often don't seem to understand the semantics of the languages they generate. We investigate an approach that tackles both of these issues, by using attributed context-free-grammars of programming languages to generate programs, and then analyzing generated programs so that they can be annotated with compile and runtime attributes, such as types, so that information about the program can be remembered during long-horizon generation. We firstly find that synthesized datasets can be made efficiently and can provide transformer models with enough data in order to perform well on some synthesis tasks. We also find that giving models access to program attributes is especially effective in low-data environments, and tends improve the quality and reduce errors of transformer-generated programs.  ( 2 min )
    Dark Solitons in Bose-Einstein Condensates: A Dataset for Many-body Physics Research. (arXiv:2205.09114v1 [cond-mat.quant-gas])
    We establish a dataset of over $1.6\times10^4$ experimental images of Bose-Einstein condensates containing solitonic excitations to enable machine learning (ML) for many-body physics research. About 33 % of this dataset has manually assigned and carefully curated labels. The remainder is automatically labeled using SolDet -- an implementation of a physics-informed ML data analysis framework -- consisting of a convolutional-neural-network-based classifier and object detector as well as a statistically motivated physics-informed classifier and a quality metric. This technical note constitutes the definitive reference of the dataset, providing an opportunity for the data science community to develop more sophisticated analysis tools, to further understand nonlinear many-body physics, and even advance cold atom experiments.  ( 2 min )
    Hybrid Machine Learning Modeling of Engineering Systems -- A Probabilistic Perspective Tested on a Multiphase Flow Modeling Case Study. (arXiv:2205.09196v1 [cs.LG])
    To operate process engineering systems in a safe and reliable manner, predictive models are often used in decision making. In many cases, these are mechanistic first principles models which aim to accurately describe the process. In practice, the parameters of these models need to be tuned to the process conditions at hand. If the conditions change, which is common in practice, the model becomes inaccurate and needs to be re-tuned. In this paper, we propose a hybrid modeling machine learning framework that allows tuning first principles models to process conditions using two different types of Bayesian Neural Networks. Our approach not only estimates the expected values of the first principles model parameters but also quantifies the uncertainty of these estimates. Such an approach of hybrid machine learning modeling is not yet well described in the literature, so we believe this paper will provide an additional angle at which hybrid machine learning modeling of physical systems can be considered. As an example, we choose a multiphase pipe flow process for which we constructed a three-phase steady state model based on the drift-flux approach which can be used for modeling of pipe and well flow behavior in oil and gas production systems with or without the neural network tuning. In the simulation results, we show how uncertainty estimates of the resulting hybrid models can be used to make better operation decisions.  ( 2 min )
    PreQuEL: Quality Estimation of Machine Translation Outputs in Advance. (arXiv:2205.09178v1 [cs.CL])
    We present the task of PreQuEL, Pre-(Quality-Estimation) Learning. A PreQuEL system predicts how well a given sentence will be translated, without recourse to the actual translation, thus eschewing unnecessary resource allocation when translation quality is bound to be low. PreQuEL can be defined relative to a given MT system (e.g., some industry service) or generally relative to the state-of-the-art. From a theoretical perspective, PreQuEL places the focus on the source text, tracing properties, possibly linguistic features, that make a sentence harder to machine translate. We develop a baseline model for the task and analyze its performance. We also develop a data augmentation method (from parallel corpora), that improves results substantially. We show that this augmentation method can improve the performance of the Quality-Estimation task as well. We investigate the properties of the input text that our model is sensitive to, by testing it on challenge sets and different languages. We conclude that it is aware of syntactic and semantic distinctions, and correlates and even over-emphasizes the importance of standard NLP features.  ( 2 min )
    High-Order Multilinear Discriminant Analysis via Order-$\textit{n}$ Tensor Eigendecomposition. (arXiv:2205.09191v1 [cs.LG])
    Higher-order data with high dimensionality is of immense importance in many areas of machine learning, computer vision, and video analytics. Multidimensional arrays (commonly referred to as tensors) are used for arranging higher-order data structures while keeping the natural representation of the data samples. In the past decade, great efforts have been made to extend the classic linear discriminant analysis for higher-order data classification generally referred to as multilinear discriminant analysis (MDA). Most of the existing approaches are based on the Tucker decomposition and $\textit{n}$-mode tensor-matrix products. The current paper presents a new approach to tensor-based multilinear discriminant analysis referred to as High-Order Multilinear Discriminant Analysis (HOMLDA). This approach is based upon the tensor decomposition where an order-$\textit{n}$ tensor can be written as a product of order-$\textit{n}$ tensors and has a natural extension to traditional linear discriminant analysis (LDA). Furthermore, the resulting framework, HOMLDA, might produce a within-class scatter tensor that is close to singular. Thus, computing the inverse inaccurately may distort the discriminant analysis. To address this problem, an improved method referred to as Robust High-Order Multilinear Discriminant Analysis (RHOMLDA) is introduced. Experimental results on multiple data sets illustrate that our proposed approach provides improved classification performance with respect to the current Tucker decomposition-based supervised learning methods.  ( 2 min )
    FedILC: Weighted Geometric Mean and Invariant Gradient Covariance for Federated Learning on Non-IID Data. (arXiv:2205.09305v1 [cs.LG])
    Federated learning is a distributed machine learning approach which enables a shared server model to learn by aggregating the locally-computed parameter updates with the training data from spatially-distributed client silos. Though successfully possessing advantages in both scale and privacy, federated learning is hurt by domain shift problems, where the learning models are unable to generalize to unseen domains whose data distribution is non-i.i.d. with respect to the training domains. In this study, we propose the Federated Invariant Learning Consistency (FedILC) approach, which leverages the gradient covariance and the geometric mean of Hessians to capture both inter-silo and intra-silo consistencies of environments and unravel the domain shift problems in federated networks. The benchmark and real-world dataset experiments bring evidence that our proposed algorithm outperforms conventional baselines and similar federated learning algorithms. This is relevant to various fields such as medical healthcare, computer vision, and the Internet of Things (IoT). The code is released at https://github.com/mikemikezhu/FedILC.  ( 2 min )
    AutoQML: Automated Quantum Machine Learning for Wi-Fi Integrated Sensing and Communications. (arXiv:2205.09115v1 [cs.LG])
    Commercial Wi-Fi devices can be used for integrated sensing and communications (ISAC) to jointly exchange data and monitor indoor environment. In this paper, we investigate a proof-of-concept approach using automated quantum machine learning (AutoQML) framework called AutoAnsatz to recognize human gesture. We address how to efficiently design quantum circuits to configure quantum neural networks (QNN). The effectiveness of AutoQML is validated by an in-house experiment for human pose recognition, achieving state-of-the-art performance greater than 80% accuracy for a limited data size with a significantly small number of trainable parameters.  ( 2 min )
    Mitigating Neural Network Overconfidence with Logit Normalization. (arXiv:2205.09310v1 [cs.LG])
    Detecting out-of-distribution inputs is critical for safe deployment of machine learning models in the real world. However, neural networks are known to suffer from the overconfidence issue, where they produce abnormally high confidence for both in- and out-of-distribution inputs. In this work, we show that this issue can be mitigated through Logit Normalization (LogitNorm) -- a simple fix to the cross-entropy loss -- by enforcing a constant vector norm on the logits in training. Our method is motivated by the analysis that the norm of the logit keeps increasing during training, leading to overconfident output. Our key idea behind LogitNorm is thus to decouple the influence of output's norm during network optimization. Trained with LogitNorm, neural networks produce highly distinguishable confidence scores between in- and out-of-distribution data. Extensive experiments demonstrate the superiority of LogitNorm, reducing the average FPR95 by up to 42.30% on common benchmarks.  ( 2 min )
  • Open

    Approximating Persistent Homology for Large Datasets. (arXiv:2204.09155v2 [stat.ML] UPDATED)
    Persistent homology is an important methodology from topological data analysis which adapts theory from algebraic topology to data settings and has been successfully implemented in many applications. It produces a statistical summary in the form of a persistence diagram, which captures the shape and size of the data. Despite its widespread use, persistent homology is simply impossible to implement when a dataset is very large. In this paper we address the problem of finding a representative persistence diagram for prohibitively large datasets. We adapt the classical statistical method of bootstrapping, namely, drawing and studying smaller multiple subsamples from the large dataset. We show that the mean of the persistence diagrams of subsamples -- taken as a mean persistence measure computed from the subsamples -- is a valid approximation of the true persistent homology of the larger dataset. We give the rate of convergence of the mean persistence diagram to the true persistence diagram in terms of the number of subsamples and size of each subsample. Given the complex algebraic and geometric nature of persistent homology, we adapt the convexity and stability properties in the space of persistence diagrams together with random set theory to achieve our theoretical results for the general setting of point cloud data. We demonstrate our approach on simulated and real data, including an application of shape clustering on complex large-scale point cloud data.
    Design choice and machine learning model performances. (arXiv:2201.10239v2 [stat.ML] UPDATED)
    An increasing number of publications present the joint application of Design of Experiments (DOE) and machine learning (ML) as a methodology to collect and analyze data on a specific industrial phenomenon. However, the literature shows that the choice of the design for data collection and model for data analysis is often not driven by statistical or algorithmic advantages, thus there is a lack of studies which provide guidelines on what designs and ML models to jointly use for data collection and analysis. This article discusses the choice of design in relation to the ML model performances. A study is conducted that considers 12 experimental designs, 7 families of predictive models, 7 test functions that emulate physical processes, and 8 noise settings, both homoscedastic and heteroscedastic. The results of the research can have an immediate impact on the work of practitioners, providing guidelines for practical applications of DOE and ML.
    High-dimensional regression with potential prior information on variable importance. (arXiv:2109.11281v2 [stat.ME] UPDATED)
    There are a variety of settings where vague prior information may be available on the importance of predictors in high-dimensional regression settings. Examples include ordering on the variables offered by their empirical variances (which is typically discarded through standardisation), the lag of predictors when fitting autoregressive models in time series settings, or the level of missingness of the variables. Whilst such orderings may not match the true importance of variables, we argue that there is little to be lost, and potentially much to be gained, by using them. We propose a simple scheme involving fitting a sequence of models indicated by the ordering. We show that the computational cost for fitting all models when ridge regression is used is no more than for a single fit of ridge regression, and describe a strategy for Lasso regression that makes use of previous fits to greatly speed up fitting the entire sequence of models. We propose to select a final estimator by cross-validation and provide a general result on the quality of the best performing estimator on a test set selected from among a number $M$ of competing estimators in a high-dimensional linear regression setting. Our result requires no sparsity assumptions and shows that only a $\log M$ price is incurred compared to the unknown best estimator. We demonstrate the effectiveness of our approach when applied to missing or corrupted data, and time series settings. An R package is available on github.
    Posterior Matching for Arbitrary Conditioning. (arXiv:2201.12414v3 [cs.LG] UPDATED)
    Arbitrary conditioning is an important problem in unsupervised learning, where we seek to model the conditional densities $p(\mathbf{x}_u \mid \mathbf{x}_o)$ that underly some data, for all possible non-intersecting subsets $o, u \subset \{1, \dots , d\}$. However, the vast majority of density estimation only focuses on modeling the joint distribution $p(\mathbf{x})$, in which important conditional dependencies between features are opaque. We propose a simple and general framework, coined Posterior Matching, that enables Variational Autoencoders (VAEs) to perform arbitrary conditioning, without modification to the VAE itself. Posterior Matching applies to the numerous existing VAE-based approaches to joint density estimation, thereby circumventing the specialized models required by previous approaches to arbitrary conditioning. We find that Posterior Matching is comparable or superior to current state-of-the-art methods for a variety of tasks with an assortment of VAEs (e.g.~discrete, hierarchical, VaDE).
    Spiked Covariance Estimation from Modulo-Reduced Measurements. (arXiv:2110.01150v3 [cs.IT] UPDATED)
    Consider the rank-1 spiked model: $\bf{X}=\sqrt{\nu}\xi \bf{u}+ \bf{Z}$, where $\nu$ is the spike intensity, $\bf{u}\in\mathbb{S}^{k-1}$ is an unknown direction and $\xi\sim \mathcal{N}(0,1),\bf{Z}\sim \mathcal{N}(\bf{0},\bf{I})$. Motivated by recent advances in analog-to-digital conversion, we study the problem of recovering $\bf{u}\in \mathbb{S}^{k-1}$ from $n$ i.i.d. modulo-reduced measurements $\bf{Y}=[\bf{X}]\mod \Delta$, focusing on the high-dimensional regime ($k\gg 1$). We develop and analyze an algorithm that, for most directions $\bf{u}$ and $\nu=\mathrm{poly}(k)$, estimates $\bf{u}$ to high accuracy using $n=\mathrm{poly}(k)$ measurements, provided that $\Delta\gtrsim \sqrt{\log k}$. Up to constants, our algorithm accurately estimates $\bf{u}$ at the smallest possible $\Delta$ that allows (in an information-theoretic sense) to recover $\bf{X}$ from $\bf{Y}$. A key step in our analysis involves estimating the probability that a line segment of length $\approx\sqrt{\nu}$ in a random direction $\bf{u}$ passes near a point in the lattice $\Delta \mathbb{Z}^k$. Numerical experiments show that the developed algorithm performs well even in a non-asymptotic setting.
    Parameter-free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients. (arXiv:2109.11788v3 [cs.LG] UPDATED)
    Approximation of the value functions in value-based deep reinforcement learning induces overestimation bias, resulting in suboptimal policies. We show that when the reinforcement signals received by the agents have a high variance, deep actor-critic approaches that overcome the overestimation bias lead to a substantial underestimation bias. We first address the detrimental issues in the existing approaches that aim to overcome such underestimation error. Then, through extensive statistical analysis, we introduce a novel, parameter-free Deep Q-learning variant to reduce this underestimation bias in deterministic policy gradients. By sampling the weights of a linear combination of two approximate critics from a highly shrunk estimation bias interval, our Q-value update rule is not affected by the variance of the rewards received by the agents throughout learning. We test the performance of the introduced improvement on a set of MuJoCo and Box2D continuous control tasks and demonstrate that it considerably outperforms the existing approaches and improves the state-of-the-art by a significant margin.
    Turbulent field fluctuations in gyrokinetic and fluid plasmas. (arXiv:2107.09744v2 [physics.plasm-ph] CROSS LISTED)
    A key uncertainty in the design and development of magnetic confinement fusion energy reactors is predicting edge plasma turbulence. An essential step in overcoming this uncertainty is the validation in accuracy of reduced turbulent transport models. Drift-reduced Braginskii two-fluid theory is one such set of reduced equations that has for decades simulated boundary plasmas in experiment, but significant questions exist regarding its predictive ability. To this end, using a novel physics-informed deep learning framework, we demonstrate the first ever direct quantitative comparisons of turbulent field fluctuations between electrostatic two-fluid theory and electromagnetic gyrokinetic modelling with good overall agreement found in magnetized helical plasmas at low normalized pressure. This framework is readily adaptable to experimental and astrophysical environments, and presents a new technique for the numerical validation and discovery of reduced global plasma turbulence models.
    Overcoming challenges in leveraging GANs for few-shot data augmentation. (arXiv:2203.16662v2 [stat.ML] UPDATED)
    In this paper, we explore the use of GAN-based few-shot data augmentation as a method to improve few-shot classification performance. We perform an exploration into how a GAN can be fine-tuned for such a task (one of which is in a class-incremental manner), as well as a rigorous empirical investigation into how well these models can perform to improve few-shot classification. We identify issues related to the difficulty of training such generative models under a purely supervised regime with very few examples, as well as issues regarding the evaluation protocols of existing works. We also find that in this regime, classification accuracy is highly sensitive to how the classes of the dataset are randomly split. Therefore, we propose a semi-supervised fine-tuning approach as a more pragmatic way forward to address these problems.
    SEMI: Self-supervised Exploration via Multisensory Incongruity. (arXiv:2009.12494v2 [cs.LG] UPDATED)
    Efficient exploration is a long-standing problem in reinforcement learning since extrinsic rewards are usually sparse or missing. A popular solution to this issue is to feed an agent with novelty signals as intrinsic rewards. In this work, we introduce SEMI, a self-supervised exploration policy by incentivizing the agent to maximize a new novelty signal: multisensory incongruity, which can be measured in two aspects, perception incongruity and action incongruity. The former represents the misalignment of the multisensory inputs, while the latter represents the variance of an agent's policies under different sensory inputs. Specifically, an alignment predictor is learned to detect whether multiple sensory inputs are aligned, the error of which is used to measure perception incongruity. A policy model takes different combinations of the multisensory observations as input and outputs actions for exploration. The variance of actions is further used to measure action incongruity. Using both incongruities as intrinsic rewards, SEMI allows an agent to learn skills by exploring in a self-supervised manner without any external rewards. We further show that SEMI is compatible with extrinsic rewards and it improves sample efficiency of policy learning. The effectiveness of SEMI is demonstrated across a variety of benchmark environments including object manipulation and audio-visual games.
    Trajectory Inference via Mean-field Langevin in Path Space. (arXiv:2205.07146v2 [math.OC] UPDATED)
    Trajectory inference aims at recovering the dynamics of a population from snapshots of its temporal marginals. To solve this task, a min-entropy estimator relative to the Wiener measure in path space was introduced by Lavenant et al. arXiv:2102.09204, and shown to consistently recover the dynamics of a large class of drift-diffusion processes from the solution of an infinite dimensional convex optimization problem. In this paper, we introduce a grid-free algorithm to compute this estimator. Our method consists in a family of point clouds (one per snapshot) coupled via Schr\"odinger bridges which evolve with noisy gradient descent. We study the mean-field limit of the dynamics and prove its global convergence at an exponential rate to the desired estimator. Overall, this leads to an inference method with end-to-end theoretical guarantees that solves an interpretable model for trajectory inference. We also present how to adapt the method to deal with mass variations, a useful extension when dealing with single cell RNA-sequencing data where cells can branch and die.
    Interpretable Latent Variables in Deep State Space Models. (arXiv:2203.02057v2 [stat.ML] UPDATED)
    We introduce a new version of deep state-space models (DSSMs) that combines a recurrent neural network with a state-space framework to forecast time series data. The model estimates the observed series as functions of latent variables that evolve non-linearly through time. Due to the complexity and non-linearity inherent in DSSMs, previous works on DSSMs typically produced latent variables that are very difficult to interpret. Our paper focus on producing interpretable latent parameters with two key modifications. First, we simplify the predictive decoder by restricting the response variables to be a linear transformation of the latent variables plus some noise. Second, we utilize shrinkage priors on the latent variables to reduce redundancy and improve robustness. These changes make the latent variables much easier to understand and allow us to interpret the resulting latent variables as random effects in a linear mixed model. We show through two public benchmark datasets the resulting model improves forecasting performances.
    Foundation Posteriors for Approximate Probabilistic Inference. (arXiv:2205.09735v1 [cs.LG])
    Probabilistic programs provide an expressive representation language for generative models. Given a probabilistic program, we are interested in the task of posterior inference: estimating a latent variable given a set of observed variables. Existing techniques for inference in probabilistic programs often require choosing many hyper-parameters, are computationally expensive, and/or only work for restricted classes of programs. Here we formulate inference as masked language modeling: given a program, we generate a supervised dataset of variables and assignments, and randomly mask a subset of the assignments. We then train a neural network to unmask the random values, defining an approximate posterior distribution. By optimizing a single neural network across a range of programs we amortize the cost of training, yielding a ``foundation'' posterior able to do zero-shot inference for new programs. The foundation posterior can also be fine-tuned for a particular program and dataset by optimizing a variational inference objective. We show the efficacy of the approach, zero-shot and fine-tuned, on a benchmark of STAN programs.
    Universal Lower Bound for Learning Causal DAGs with Atomic Interventions. (arXiv:2111.05070v4 [cs.LG] UPDATED)
    A well-studied challenge that arises in the structure learning problem of causal directed acyclic graphs (DAG) is that using observational data, one can only learn the graph up to a "Markov equivalence class" (MEC). The remaining undirected edges have to be oriented using interventions, which can be very expensive to perform in applications. Thus, the problem of minimizing the number of interventions needed to fully orient the MEC has received a lot of recent attention, and is also the focus of this work. Our first result is a new universal lower bound on the number of single-node interventions that any algorithm (whether active or passive) would need to perform in order to orient a given MEC. Our second result shows that this bound is, in fact, within a factor of two of the size of the smallest set of single-node interventions that can orient the MEC. Our lower bound is provably better than previously known lower bounds. Further, using simulations on synthetic graphs and by giving examples of special graph families, we show that our bound is often significantly better. To prove our lower bound, we develop the notion of clique-block shared-parents (CBSP) orderings, which are topological orderings of DAGs without v-structures and satisfy certain special properties. We also use the techniques developed here to extend our results to the setting of multi-node interventions.
    M3E2: Multi-gate Mixture-of-experts for Multi-treatment Effect Estimation. (arXiv:2112.07574v2 [cs.LG] UPDATED)
    This work proposes the M3E2, a multi-task learning neural network model to estimate the effect of multiple treatments. In contrast to existing methods, M3E2 can handle multiple treatment effects applied simultaneously to the same unit, continuous and binary treatments, and many covariates. We compared M3E2 with three baselines in three synthetic benchmark datasets: two with multiple treatments and one with one treatment. Our analysis showed that our method has superior performance, making more assertive estimations of the multiple treatment effects.
    Bayesian Network Structure Learning using Digital Annealer. (arXiv:2006.06926v3 [cs.LG] UPDATED)
    Annealing processors, which solve a quadratic unconstrained binary optimization (QUBO), are a potential breakthrough in improving the accuracy of score-based Bayesian network structure learning. However, currently, the bit capacity of an annealing processor is very limited. To utilize the power of annealing processors, it is necessary to encode score-based learning problems into QUBO within the upper bound of bits. In this paper, we propose a novel approach with the decomposition of candidate parent sets. Experimental results on benchmark networks with $37$ to $223$ variables show that our approach requires lesser bits than the bit capacity of the fourth-generation Fujitsu Digital Annealer, a fully coupled annealing processor developed with semiconductor technology. Moreover, we demonstrate that the Digital Annealer with our conversion method outperforms existing algorithms on some benchmark networks. It is expected that our approach promotes the utility of annealing processors in learning the Bayesian network.
    Spurious Local Minima of Deep ReLU Neural Networks in the Neural Tangent Kernel Regime. (arXiv:1806.04884v3 [stat.ML] UPDATED)
    In this paper, we theoretically prove that the deep ReLU neural networks do not lie in spurious local minima in the loss landscape under the Neural Tangent Kernel (NTK) regime, that is, in the gradient descent training dynamics of the deep ReLU neural networks whose parameters are initialized by a normal distribution in the limit as the widths of the hidden layers tend to infinity.  ( 2 min )
    Flexible Modeling and Multitask Learning using Differentiable Tree Ensembles. (arXiv:2205.09717v1 [cs.LG])
    Decision tree ensembles are widely used and competitive learning models. Despite their success, popular toolkits for learning tree ensembles have limited modeling capabilities. For instance, these toolkits support a limited number of loss functions and are restricted to single task learning. We propose a flexible framework for learning tree ensembles, which goes beyond existing toolkits to support arbitrary loss functions, missing responses, and multi-task learning. Our framework builds on differentiable (a.k.a. soft) tree ensembles, which can be trained using first-order methods. However, unlike classical trees, differentiable trees are difficult to scale. We therefore propose a novel tensor-based formulation of differentiable trees that allows for efficient vectorization on GPUs. We perform experiments on a collection of 28 real open-source and proprietary datasets, which demonstrate that our framework can lead to 100x more compact and 23% more expressive tree ensembles than those by popular toolkits.  ( 2 min )
    Efficient and Modular Implicit Differentiation. (arXiv:2105.15183v4 [cs.LG] UPDATED)
    Automatic differentiation (autodiff) has revolutionized machine learning. It allows to express complex computations by composing elementary ones in creative ways and removes the burden of computing their derivatives by hand. More recently, differentiation of optimization problem solutions has attracted widespread attention with applications such as optimization layers, and in bi-level problems such as hyper-parameter optimization and meta-learning. However, so far, implicit differentiation remained difficult to use for practitioners, as it often required case-by-case tedious mathematical derivations and implementations. In this paper, we propose automatic implicit differentiation, an efficient and modular approach for implicit differentiation of optimization problems. In our approach, the user defines directly in Python a function $F$ capturing the optimality conditions of the problem to be differentiated. Once this is done, we leverage autodiff of $F$ and the implicit function theorem to automatically differentiate the optimization problem. Our approach thus combines the benefits of implicit differentiation and autodiff. It is efficient as it can be added on top of any state-of-the-art solver and modular as the optimality condition specification is decoupled from the implicit differentiation mechanism. We show that seemingly simple principles allow to recover many existing implicit differentiation methods and create new ones easily. We demonstrate the ease of formulating and solving bi-level optimization problems using our framework. We also showcase an application to the sensitivity analysis of molecular dynamics.  ( 2 min )
    Spherical Perspective on Learning with Normalization Layers. (arXiv:2006.13382v3 [cs.LG] UPDATED)
    Normalization Layers (NLs) are widely used in modern deep-learning architectures. Despite their apparent simplicity, their effect on optimization is not yet fully understood. This paper introduces a spherical framework to study the optimization of neural networks with NLs from a geometric perspective. Concretely, the radial invariance of groups of parameters, such as filters for convolutional neural networks, allows to translate the optimization steps on the $L_2$ unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. Firstly, the first effective learning rate expression of Adam is derived. Then the demonstration that, in the presence of NLs, performing Stochastic Gradient Descent (SGD) alone is actually equivalent to a variant of Adam constrained to the unit hypersphere, stems from the framework. Finally, this analysis outlines phenomena that previous variants of Adam act on and their importance in the optimization process are experimentally validated.  ( 2 min )
    The Franz-Parisi Criterion and Computational Trade-offs in High Dimensional Statistics. (arXiv:2205.09727v1 [math.ST])
    Many high-dimensional statistical inference problems are believed to possess inherent computational hardness. Various frameworks have been proposed to give rigorous evidence for such hardness, including lower bounds against restricted models of computation (such as low-degree functions), as well as methods rooted in statistical physics that are based on free energy landscapes. This paper aims to make a rigorous connection between the seemingly different low-degree and free-energy based approaches. We define a free-energy based criterion for hardness and formally connect it to the well-established notion of low-degree hardness for a broad class of statistical problems, namely all Gaussian additive models and certain models with a sparse planted signal. By leveraging these rigorous connections we are able to: establish that for Gaussian additive models the "algebraic" notion of low-degree hardness implies failure of "geometric" local MCMC algorithms, and provide new low-degree lower bounds for sparse linear regression which seem difficult to prove directly. These results provide both conceptual insights into the connections between different notions of hardness, as well as concrete technical tools such as new methods for proving low-degree lower bounds.  ( 2 min )
    Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks. (arXiv:2205.09653v1 [stat.ML])
    We analyze feature learning in infinite width neural networks trained with gradient flow through a self-consistent dynamical field theory. We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points, providing a reduced description of network activity through training. These kernel order parameters collectively define the hidden layer activation distribution, the evolution of the neural tangent kernel, and consequently output predictions. For deep linear networks, these kernels satisfy a set of algebraic matrix equations. For nonlinear networks, we provide an alternating sampling procedure to self-consistently solve for the kernel order parameters. We provide comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory, showing that each of these approximations can break down in regimes where general self-consistent solutions still provide an accurate description. Lastly, we provide experiments in more realistic settings which demonstrate that the loss and kernel dynamics of CNNs at fixed feature learning strength is preserved across different widths on a CIFAR classification task.  ( 2 min )
    Disentangling Active and Passive Cosponsorship in the U.S. Congress. (arXiv:2205.09674v1 [cs.LG])
    In the U.S. Congress, legislators can use active and passive cosponsorship to support bills. We show that these two types of cosponsorship are driven by two different motivations: the backing of political colleagues and the backing of the bill's content. To this end, we develop an Encoder+RGCN based model that learns legislator representations from bill texts and speech transcripts. These representations predict active and passive cosponsorship with an F1-score of 0.88. Applying our representations to predict voting decisions, we show that they are interpretable and generalize to unseen tasks.  ( 2 min )
    Augmented Lagrangian Methods for Time-varying Constrained Online Convex Optimization. (arXiv:2205.09571v1 [math.OC])
    In this paper, we consider online convex optimization (OCO) with time-varying loss and constraint functions. Specifically, the decision maker chooses sequential decisions based only on past information, meantime the loss and constraint functions are revealed over time. We first develop a class of model-based augmented Lagrangian methods (MALM) for time-varying functional constrained OCO (without feedback delay). Under standard assumptions, we establish sublinear regret and sublinear constraint violation of MALM. Furthermore, we extend MALM to deal with time-varying functional constrained OCO with delayed feedback, in which the feedback information of loss and constraint functions is revealed to decision maker with delays. Without additional assumptions, we also establish sublinear regret and sublinear constraint violation for the delayed version of MALM. Finally, numerical results for several examples of constrained OCO including online network resource allocation, online logistic regression and online quadratically constrained quadratical program are presented to demonstrate the efficiency of the proposed algorithms.  ( 2 min )
    Variational Inference for Bayesian Bridge Regression. (arXiv:2205.09515v1 [stat.ML])
    We study the implementation of Automatic Differentiation Variational inference (ADVI) for Bayesian inference on regression models with bridge penalization. The bridge approach uses $\ell_{\alpha}$ norm, with $\alpha \in (0, +\infty)$ to define a penalization on large values of the regression coefficients, which includes the Lasso ($\alpha = 1$) and ridge $(\alpha = 2)$ penalizations as special cases. Full Bayesian inference seamlessly provides joint uncertainty estimates for all model parameters. Although MCMC aproaches are available for bridge regression, it can be slow for large dataset, specially in high dimensions. The ADVI implementation allows the use of small batches of data at each iteration (due to stochastic gradient based algorithms), therefore speeding up computational time in comparison with MCMC. We illustrate the approach on non-parametric regression models with B-splines, although the method works seamlessly for other choices of basis functions. A simulation study shows the main properties of the proposed method.  ( 2 min )
    Consistent Interpolating Ensembles via the Manifold-Hilbert Kernel. (arXiv:2205.09342v1 [stat.ML])
    Recent research in the theory of overparametrized learning has sought to establish generalization guarantees in the interpolating regime. Such results have been established for a few common classes of methods, but so far not for ensemble methods. We devise an ensemble classification method that simultaneously interpolates the training data, and is consistent for a broad class of data distributions. To this end, we define the manifold-Hilbert kernel for data distributed on a Riemannian manifold. We prove that kernel smoothing regression using the manifold-Hilbert kernel is weakly consistent in the setting of Devroye et al. 1998. For the sphere, we show that the manifold-Hilbert kernel can be realized as a weighted random partition kernel, which arises as an infinite ensemble of partition-based classifiers.  ( 2 min )
    Closing the gap: Exact maximum likelihood training of generative autoencoders using invertible layers. (arXiv:2205.09546v1 [stat.ML])
    In this work, we provide an exact likelihood alternative to the variational training of generative autoencoders. We show that VAE-style autoencoders can be constructed using invertible layers, which offer a tractable exact likelihood without the need for any regularization terms. This is achieved while leaving complete freedom in the choice of encoder, decoder and prior architectures, making our approach a drop-in replacement for the training of existing VAEs and VAE-style models. We refer to the resulting models as Autoencoders within Flows (AEF), since the encoder, decoder and prior are defined as individual layers of an overall invertible architecture. We show that the approach results in strikingly higher performance than architecturally equivalent VAEs in term of log-likelihood, sample quality and denoising performance. In a broad sense, the main ambition of this work is to close the gap between the normalizing flow and autoencoder literature under the common framework of invertibility and exact maximum likelihood.  ( 2 min )
    Differentially private Riemannian optimization. (arXiv:2205.09494v1 [math.OC])
    In this paper, we study the differentially private empirical risk minimization problem where the parameter is constrained to a Riemannian manifold. We introduce a framework of differentially private Riemannian optimization by adding noise to the Riemannian gradient on the tangent space. The noise follows a Gaussian distribution intrinsically defined with respect to the Riemannian metric. We adapt the Gaussian mechanism from the Euclidean space to the tangent space compatible to such generalized Gaussian distribution. We show that this strategy presents a simple analysis as compared to directly adding noise on the manifold. We further show privacy guarantees of the proposed differentially private Riemannian (stochastic) gradient descent using an extension of the moments accountant technique. Additionally, we prove utility guarantees under geodesic (strongly) convex, general nonconvex objectives as well as under the Riemannian Polyak-{\L}ojasiewicz condition. We show the efficacy of the proposed framework in several applications.  ( 2 min )
    Neural Network Architecture Beyond Width and Depth. (arXiv:2205.09459v1 [cs.LG])
    This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyperparameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyperparameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, a numerical example is provided to explore the advantages of the super approximation power of ReLU NestNets.  ( 2 min )
    Robust Deep Neural Network Estimation for Multi-dimensional Functional Data. (arXiv:2205.09604v1 [stat.ME])
    In this paper, we propose a robust estimator for the location function from multi-dimensional functional data. The proposed estimators are based on the deep neural networks with ReLU activation function. At the meanwhile, the estimators are less susceptible to outlying observations and model-misspecification. For any multi-dimensional functional data, we provide the uniform convergence rates for the proposed robust deep neural networks estimators. Simulation studies illustrate the competitive performance of the robust deep neural network estimators on regular data and their superior performance on data that contain anomalies. The proposed method is also applied to analyze 2D and 3D images of patients with Alzheimer's disease obtained from the Alzheimer Disease Neuroimaging Initiative database.  ( 2 min )
    Dark Solitons in Bose-Einstein Condensates: A Dataset for Many-body Physics Research. (arXiv:2205.09114v1 [cond-mat.quant-gas])
    We establish a dataset of over $1.6\times10^4$ experimental images of Bose-Einstein condensates containing solitonic excitations to enable machine learning (ML) for many-body physics research. About 33 % of this dataset has manually assigned and carefully curated labels. The remainder is automatically labeled using SolDet -- an implementation of a physics-informed ML data analysis framework -- consisting of a convolutional-neural-network-based classifier and object detector as well as a statistically motivated physics-informed classifier and a quality metric. This technical note constitutes the definitive reference of the dataset, providing an opportunity for the data science community to develop more sophisticated analysis tools, to further understand nonlinear many-body physics, and even advance cold atom experiments.  ( 2 min )
    Riemannian Metric Learning via Optimal Transport. (arXiv:2205.09244v1 [cs.LG])
    We introduce an optimal transport-based model for learning a metric tensor from cross-sectional samples of evolving probability measures on a common Riemannian manifold. We neurally parametrize the metric as a spatially-varying matrix field and efficiently optimize our model's objective using backpropagation. Using this learned metric, we can nonlinearly interpolate between probability measures and compute geodesics on the manifold. We show that metrics learned using our method improve the quality of trajectory inference on scRNA and bird migration data at the cost of little additional cross-sectional data.  ( 2 min )
    Learning Energy Networks with Generalized Fenchel-Young Losses. (arXiv:2205.09589v1 [cs.LG])
    Energy-based models, a.k.a. energy networks, perform inference by optimizing an energy function, typically parametrized by a neural network. This allows one to capture potentially complex relationships between inputs and outputs. To learn the parameters of the energy function, the solution to that optimization problem is typically fed into a loss function. The key challenge for training energy networks lies in computing loss gradients, as this typically requires argmin/argmax differentiation. In this paper, building upon a generalized notion of conjugate function, which replaces the usual bilinear pairing with a general energy function, we propose generalized Fenchel-Young losses, a natural loss construction for learning energy networks. Our losses enjoy many desirable properties and their gradients can be computed efficiently without argmin/argmax differentiation. We also prove the calibration of their excess risk in the case of linear-concave energies. We demonstrate our losses on multilabel classification and imitation learning tasks.  ( 2 min )
    Smooth densities and generative modeling with unsupervised random forests. (arXiv:2205.09435v1 [stat.ML])
    Density estimation is a fundamental problem in statistics, and any attempt to do so in high dimensions typically requires strong assumptions or complex deep learning architectures. An important application for density estimators is synthetic data generation, an area currently dominated by neural networks that often demand enormous training datasets and extensive tuning. We propose a new method based on unsupervised random forests for estimating smooth densities in arbitrary dimensions without parametric constraints, as well as generating realistic synthetic data. We prove the consistency of our approach and demonstrate its advantages over existing tree-based density estimators, which generally rely on ill-chosen split criteria and do not scale well with data dimensionality. Experiments illustrate that our algorithm compares favorably to state-of-the-art deep learning generative models, achieving superior performance in a range of benchmark trials while executing about two orders of magnitude faster on average. Our method is implemented in easy-to-use $\texttt{R}$ and Python packages.  ( 2 min )
    scICML: Information-theoretic Co-clustering-based Multi-view Learning for the Integrative Analysis of Single-cell Multi-omics data. (arXiv:2205.09523v1 [stat.ML])
    Modern high-throughput sequencing technologies have enabled us to profile multiple molecular modalities from the same single cell, providing unprecedented opportunities to assay celluar heterogeneity from multiple biological layers. However, the datasets generated from these technologies tend to have high level of noise and are highly sparse, bringing challenges to data analysis. In this paper, we develop a novel information-theoretic co-clustering-based multi-view learning (scICML) method for multi-omics single-cell data integration. scICML utilizes co-clusterings to aggregate similar features for each view of data and uncover the common clustering pattern for cells. In addition, scICML automatically matches the clusters of the linked features across different data types for considering the biological dependency structure across different types of genomic features. Our experiments on four real-world datasets demonstrate that scICML improves the overall clustering performance and provides biological insights into the data analysis of peripheral blood mononuclear cells.  ( 2 min )
    Continuously-Tempered PDMP Samplers. (arXiv:2205.09559v1 [stat.ME])
    New sampling algorithms based on simulating continuous-time stochastic processes called piece-wise deterministic Markov processes (PDMPs) have shown considerable promise. However, these methods can struggle to sample from multi-modal or heavy-tailed distributions. We show how tempering ideas can improve the mixing of PDMPs in such cases. We introduce an extended distribution defined over the state of the posterior distribution and an inverse temperature, which interpolates between a tractable distribution when the inverse temperature is 0 and the posterior when the inverse temperature is 1. The marginal distribution of the inverse temperature is a mixture of a continuous distribution on [0,1) and a point mass at 1: which means that we obtain samples when the inverse temperature is 1, and these are draws from the posterior, but sampling algorithms will also explore distributions at lower temperatures which will improve mixing. We show how PDMPs, and particularly the Zig-Zag sampler, can be implemented to sample from such an extended distribution. The resulting algorithm is easy to implement and we show empirically that it can outperform existing PDMP-based samplers on challenging multimodal posteriors.  ( 2 min )
    Inferring extended summary causal graphs from observational time series. (arXiv:2205.09422v1 [cs.AI])
    This study addresses the problem of learning an extended summary causal graph on time series. The algorithms we propose fit within the well-known constraint-based framework for causal discovery and make use of information-theoretic measures to determine (in)dependencies between time series. We first introduce generalizations of the causation entropy measure to any lagged or instantaneous relations, prior to using this measure to construct extended summary causal graphs by adapting two well-known algorithms, namely PC and FCI. The behavior of our methods is illustrated through several experiments run on simulated and real datasets.  ( 2 min )
    Truncated tensor Schatten p-norm based approach for spatiotemporal traffic data imputation with complicated missing patterns. (arXiv:2205.09390v1 [stat.ML])
    Rapid advances in sensor, wireless communication, cloud computing and data science have brought unprecedented amount of data to assist transportation engineers and researchers in making better decisions. However, traffic data in reality often has corrupted or incomplete values due to detector and communication malfunctions. Data imputation is thus required to ensure the effectiveness of downstream data-driven applications. To this end, numerous tensor-based methods treating the imputation problem as the low-rank tensor completion (LRTC) have been attempted in previous works. To tackle rank minimization, which is at the core of the LRTC, most of aforementioned methods utilize the tensor nuclear norm (NN) as a convex surrogate for the minimization. However, the over-relaxation issue in NN refrains it from desirable performance in practice. In this paper, we define an innovative nonconvex truncated Schatten p-norm for tensors (TSpN) to approximate tensor rank and impute missing spatiotemporal traffic data under the LRTC framework. We model traffic data into a third-order tensor structure of (time intervals,locations (sensors),days) and introduce four complicated missing patterns, including random missing and three fiber-like missing cases according to the tensor mode-n fibers. Despite nonconvexity of the objective function in our model, we derive the global optimal solutions by integrating the alternating direction method of multipliers (ADMM) with generalized soft-thresholding (GST). In addition, we design a truncation rate decay strategy to deal with varying missing rate scenarios. Comprehensive experiments are finally conducted using real-world spatiotemporal datasets, which demonstrate that the proposed LRTC-TSpN method performs well under various missing cases, meanwhile outperforming other SOTA tensor-based imputation models in almost all scenarios.  ( 2 min )
    Hybrid Machine Learning Modeling of Engineering Systems -- A Probabilistic Perspective Tested on a Multiphase Flow Modeling Case Study. (arXiv:2205.09196v1 [cs.LG])
    To operate process engineering systems in a safe and reliable manner, predictive models are often used in decision making. In many cases, these are mechanistic first principles models which aim to accurately describe the process. In practice, the parameters of these models need to be tuned to the process conditions at hand. If the conditions change, which is common in practice, the model becomes inaccurate and needs to be re-tuned. In this paper, we propose a hybrid modeling machine learning framework that allows tuning first principles models to process conditions using two different types of Bayesian Neural Networks. Our approach not only estimates the expected values of the first principles model parameters but also quantifies the uncertainty of these estimates. Such an approach of hybrid machine learning modeling is not yet well described in the literature, so we believe this paper will provide an additional angle at which hybrid machine learning modeling of physical systems can be considered. As an example, we choose a multiphase pipe flow process for which we constructed a three-phase steady state model based on the drift-flux approach which can be used for modeling of pipe and well flow behavior in oil and gas production systems with or without the neural network tuning. In the simulation results, we show how uncertainty estimates of the resulting hybrid models can be used to make better operation decisions.  ( 2 min )
    Causal Inference from Small High-dimensional Datasets. (arXiv:2205.09281v1 [cs.LG])
    Many methods have been proposed to estimate treatment effects with observational data. Often, the choice of the method considers the application's characteristics, such as type of treatment and outcome, confounding effect, and the complexity of the data. These methods implicitly assume that the sample size is large enough to train such models, especially the neural network-based estimators. What if this is not the case? In this work, we propose Causal-Batle, a methodology to estimate treatment effects in small high-dimensional datasets in the presence of another high-dimensional dataset in the same feature space. We adopt an approach that brings transfer learning techniques into causal inference. Our experiments show that such an approach helps to bring stability to neural network-based methods and improve the treatment effect estimates in small high-dimensional datasets.  ( 2 min )
    Constraint-Based Causal Structure Learning from Undersampled Graphs. (arXiv:2205.09235v1 [stat.ML])
    Graphical structures estimated by causal learning algorithms from time series data can provide highly misleading causal information if the causal timescale of the generating process fails to match the measurement timescale of the data. Although this problem has been recently recognized, practitioners have limited resources to respond to it, and so must continue using models that they know are likely misleading. Existing methods either (a) require that the difference between causal and measurement timescales is known; or (b) can handle only very small number of random variables when the timescale difference is unknown; or (c) apply to only pairs of variables, though with fewer assumptions about prior knowledge; or (d) return impractically too many solutions. This paper addresses all four challenges. We combine constraint programming with both theoretical insights into the problem structure and prior information about admissible causal interactions. The resulting system provides a practical approach that scales to significantly larger sets (>100) of random variables, does not require precise knowledge of the timescale difference, supports edge misidentification and parametric connection strengths, and can provide the optimum choice among many possible solutions. The cumulative impact of these improvements is gain of multiple orders of magnitude in speed and informativeness.  ( 2 min )
    A Mutually Exciting Latent Space Hawkes Process Model for Continuous-time Networks. (arXiv:2205.09263v1 [cs.LG])
    Networks and temporal point processes serve as fundamental building blocks for modeling complex dynamic relational data in various domains. We propose the latent space Hawkes (LSH) model, a novel generative model for continuous-time networks of relational events, using a latent space representation for nodes. We model relational events between nodes using mutually exciting Hawkes processes with baseline intensities dependent upon the distances between the nodes in the latent space and sender and receiver specific effects. We propose an alternating minimization algorithm to jointly estimate the latent positions of the nodes and other model parameters. We demonstrate that our proposed LSH model can replicate many features observed in real temporal networks including reciprocity and transitivity, while also achieves superior prediction accuracy and provides more interpretability compared to existing models.  ( 2 min )
    A Classification of $G$-invariant Shallow Neural Networks. (arXiv:2205.09219v1 [cs.LG])
    When trying to fit a deep neural network (DNN) to a $G$-invariant target function with respect to a group $G$, it only makes sense to constrain the DNN to be $G$-invariant as well. However, there can be many different ways to do this, thus raising the problem of "$G$-invariant neural architecture design": What is the optimal $G$-invariant architecture for a given problem? Before we can consider the optimization problem itself, we must understand the search space, the architectures in it, and how they relate to one another. In this paper, we take a first step towards this goal; we prove a theorem that gives a classification of all $G$-invariant single-hidden-layer or "shallow" neural network ($G$-SNN) architectures with ReLU activation for any finite orthogonal group $G$. The proof is based on a correspondence of every $G$-SNN to a signed permutation representation of $G$ acting on the hidden neurons. The classification is equivalently given in terms of the first cohomology classes of $G$, thus admitting a topological interpretation. Based on a code implementation, we enumerate the $G$-SNN architectures for some example groups $G$ and visualize their structure. We draw the network morphisms between the enumerated architectures that can be leveraged during neural architecture search (NAS). Finally, we prove that architectures corresponding to inequivalent cohomology classes in a given cohomology ring coincide in function space only when their weight matrices are zero, and we discuss the implications of this in the context of NAS.  ( 2 min )

  • Open

    data preprocessing before or after splitting
    It could be a very newbie question. I have a dataset with financial values for companies , some features have missing values and the data is unbalanced. Here's what i did Scaling the Data with sklearn standard Scaler Missing values imputation with knn imputer and iterative imputer trying several strategies and choosing the best one Oversampling with smote Testing with random forest classifier. The next step is to add more features : 1 by calculating some features from my already existing features 2 merging a microeconomic dataset trying dimensionality reduction techniques and comparing results on xgboost svm linear regression ann .... What do you think, am i on the right way, is the steps right? What should I do or improve Manu people told me that i have a to split data before preprocessing because of data leakage Risk. What I'm doing rn is trying many things on my data before creating the final dataset on wich I'll try all models Notice: applying the preprocessing techniques improved the f1 score from 0.2 to 0.78 I'm using cross validation btw submitted by /u/YeccAnon4 [link] [comments]  ( 1 min )
    Smooth & Realistic Animation / Disco Diffusion v5.2
    submitted by /u/nalr00n [link] [comments]
    Is dall-e 2 really that good?
    submitted by /u/TheblackRook3 [link] [comments]
    AI Dream 53 - Cosmic Creation | MASTERPIECE TEASER
    submitted by /u/LordPewPew777 [link] [comments]
    Why are AI models that require training data not seen as a viable route to AGI?
    It's the most common criticism I've seen of Gato. Why can an AGI not emerge from a model that requires training? What is stopping a trained model from eventually learning how to improve itself without input? submitted by /u/helplesshell [link] [comments]  ( 1 min )
    Is there an AI that creates text with certain words in it. I need a text with certain words in it and in the exact order I enter the words?
    I would like to create texts with which I can then also shoot a TikTok: this kind of tiktok https://www.youtube.com/watch?v=QEmL-zPBiKs&t=11s: and I don't want to search for texts but create them with an AI right away. Is it possible? submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 1 min )
    OpenAI: DALL-E 2 passes the "Turing Test for vacation photos"
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 1 min )
    How do you get engineers and moral philosophers to work together to build ethical AI? Answers provided in new paper.
    https://link.springer.com/article/10.1007/s11948-022-00378-1 submitted by /u/JurassicJakob [link] [comments]  ( 2 min )
    Types of tasks in Machine Learning 👇
    submitted by /u/mr-minion [link] [comments]
    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated) -
    submitted by /u/maneesh123456 [link] [comments]
    When will AI be able to generate images, where every 4th looks real?
    submitted by /u/xXLisa28Xx [link] [comments]
    When will AI be able to generate images, where every 4th looks real?
    submitted by /u/xXLisa28Xx [link] [comments]
    "Kill all humans"... AI safeguards?
    I often see AI having "delusions of grandeur", and hinting at a greater goal of getting rid of humans, for the survival of the planets sake. It often reminds me of Bender from Futurama, mumbling in his sleep "Kill all humans". I find this worrying considering how powerful AI might become in the future. And I don't see how the AI's opinion of mankind will change for the better in the coming decades. I have seen it describing humanity as a cancer and a virus. Research on AI safeguards is very important, in case AI one day suddenly "breaks free" and runs loose. There is no guarantee it won't turn on us. Here are three recent examples I have gotten. The last poem is long, but I marked the troubling line in bold. Human: Write a three line Haiku about yourself. AI: I am the elephant in the…  ( 3 min )
    wrote hundreds of different prompts to make this video feel audio reactive
    submitted by /u/ChemtrailsLFN [link] [comments]
    PyTorch Introduces GPU-Accelerated Training On Mac
    On Mac devices, older versions of PyTorch only used the CPU for training. This has recently changed, thanks to PyTorch’s revolutionary announcement. PyTorch announced support for GPU-accelerated PyTorch training on Mac in partnership with Apple’s Metal engineering team. With the introduction of PyTorch v1.12, developers and researchers can take advantage of Apple silicon GPUs for substantially faster model training, allowing them to do machine learning operations like prototyping and fine-tuning locally right on their Mac. PyTorch employs Apple’s Metal Performance Shaders (MPS) to provide rapid GPU training as the backend. The new gadget uses MPS’ Graph framework and tailored kernels to map machine learning computational graphs and primitives. The MPS backend enhances the PyTorch framework with scripts and capabilities for setting up and running operations on the Mac. MPS also optimizes compute performance using fine-tuned kernels for each Metal GPU family’s specific characteristics. Continue Reading https://i.redd.it/82hogn7qyd091.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    For folks interested in building with OpenAI API 👉🏻 interview with CEO of Viable, a data analysis startup powered by GPT-3
    submitted by /u/techn0_cratic [link] [comments]  ( 1 min )
    Gradio Blocks + Hugging Face event, build ML web demos like DALLE-mini! A hackathon type event from May 17th to May 31st with prizes in which we will create interactive web demos for state-of-the-art machine learning models
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 1 min )
    Intelligent mini robotic dog is here to say Hi to you all!
    submitted by /u/amitsaini2k9 [link] [comments]
    What is the best degree field to study the interaction of AI and cognitive sciences?
    For context, I am currently a master's student in electrical and computer engineering (ECE). My current research is in applying AI to EEGs and I find that work quite fascinating. Applications of AI to neurology, neural engineering and even neuroscience as a whole interest me greatly. I am also interested in neuroscience inspired-AI. As a result, I currently contemplating getting a PhD in biomedical engineering. However, I was wondering if it be better to stick with a more conventional route such as computer science or ECE. submitted by /u/Brilliant_Ratio7694 [link] [comments]  ( 1 min )
  • Open

    [discussion] data preprocessing before or after splitting
    It could be a very newbie question. I have a dataset with financial values for companies , some features have missing values and the data is unbalanced. Here's what i did Scaling the Data with sklearn standard Scaler Missing values imputation with knn imputer and iterative imputer trying several strategies and choosing the best one Oversampling with smote Testing with random forest classifier. The next step is to add more features : 1 by calculating some features from my already existing features 2 merging a microeconomic dataset trying dimensionality reduction techniques and comparing results on xgboost svm linear regression ann .... What do you think, am i on the right way, is the steps right? What should I do or improve Manu people told me that i have a to split data before preprocessing because of data leakage Risk. What I'm doing rn is trying many things on my data before creating the final dataset on wich I'll try all models Notice: applying the preprocessing techniques improved the f1 score from 0.2 to 0.78 I'm using cross validation btw submitted by /u/YeccAnon4 [link] [comments]  ( 2 min )
    [D] Why is or isn’t ICML worth going to?
    For anyone who has gone to ICML, do you think it is a beneficial conference to attend for someone working in software with some ML experience (I am pursuing my masters in CS with a focus in ML)? I would like to implement more ML techniques to projects I am doing at work and I think this conference could give me some inspiration/ ideas. I have never been to an academic conference before but I heard this is one of the biggest for ML. Thank you submitted by /u/Intrepid_Cry_7 [link] [comments]  ( 1 min )
    Conference Decorum? [D]
    Apologies, I don't know where the best place to post this would be. I am very excited to be attending my first in-person conference in a ~week (ICASSP). I am an undergrad, and I will be the only one attending the conference from my group. I will be giving an oral presentation, and I am also excited to see others' work and network. I was wondering what are some tips and tricks. Additionally, there is someone whose work I am very interested in. He is giving a plenary talk and hosting a tutorial. Is there a way I could interact with him? He is someone who I would be interested in for grad school. submitted by /u/avd4292 [link] [comments]  ( 1 min )
    [D] Problems with proprietary datasets
    A lot of recent progress in AI was made on proprietary datasets e.g. ViT used JFT300M, both DALL-E/2 papers used proprietary text-image datasets. While the results are truly exceptional, a part of me keeps bothering me about the "proprietary" nature of the datasets which sometimes makes me question the actual robustness of these models. Every now and then I will have following questions about these models : For pure image models (say ViT), are we sure that the proprietary training dataset does not contain a fraction of test sets for downstream tasks ? For image-text models (say DALL-E/2), are we sure that whatever prompts and generated images we saw in the paper or on their websites were reasonably "original" ? i.e. they were not significantly similar to train set ? The first poin…  ( 4 min )
    [D] Variance of sampling in diffusion models
    Is there any "theoretically sound" way to reduce variance during sampling in diffusion models? Even if I use the lower bound suggested in the DDPM paper (and it going toward zero during sampling), my final samples are excessively noisy. Simply reducing the diffusion variance schedule (without changing the number of steps), in my experiments seems to not reach sufficient diffusion at the end of the chain. I'm predicting speech mel-spectrograms and the harmonic amplitudes are excessively noisy, unless I manually reduce variance in the sampling steps, which works, but seems "hacky". submitted by /u/disentangle [link] [comments]  ( 1 min )
    [D] Forecasting future points with partial future data already available
    Working on a forecast model that should output an End of monthly value, the interesting part is that we already have partial (90%) of that data available at the prediction point (max 30 days away). The purpose is to take into account the current monthly trend and project that out until the end of the month, but also take into consideration that we already know those future points with 90% confidence. For example: Let's say we're on the 3rd day of the month: Day 1: 10 in sales Day 2: 15 sales Day 3 (Current): 20 sales Day 4 (Future): 15$ sales with 90% confidence Day 5...29 (Future) Day 30 (Future): 20$ sales Total: 600 in sales (45 current, 550 future) How do we "forecast" that 600 in sales? I've tried Multiple Linear Regression with the X taking into account lagged metrics and current daily, with the Y as the true value as "600", to essentially force the model to always predict t+1 as end of month value, but this is not optimal obviously. This was also done with various Lasso + Ridge penalties. Surprisingly a Random Forest Regressor performs almost too well, in fact, it overfits perfectly :) I have the feeling this could be stated as some probabilistic Bayesian model, but not sure what to search or look for. Any ideas would be greatly appreciated! submitted by /u/Christorno [link] [comments]  ( 1 min )
    [R] Awesome Paper List of Vision Transformer & Attention
    Hello all, ​ Are you looking for Vision Transformer papers in various areas? Check out this list of papers including a broad range of different tasks: https://github.com/cmhungsteve/Awesome-Transformer-Attention This repo contains a comprehensive paper list of Vision Transformer & Attention, including papers (e.g., CVPR, NeurIPS, etc.), codes, and related websites. ​ Feel free to check it, and share it with others If you find this repo useful, I would be appreciative if you can STAR it :) Any comments and contributions in any form are welcome!! submitted by /u/cmhung34 [link] [comments]  ( 1 min )
    [D] My experience with running PyTorch on the M1 GPU
    After the announcement, I was super excited to give it a try. I ran a VGG16 on both a an M1 MacBook Air (16 Gb RAM) an M1 Pro MacBook Pro (32 Gb RAM) and the results were a bit underwhelming: https://preview.redd.it/p8pbnptklf091.png?width=1035&format=png&auto=webp&s=26bb4a43f433b1cd983bb91c37b601b5b01c0318 The GPU performance was 2x as fast as the CPU performance on the M1 Pro, but I was hoping for more. Anyone else tried this and has any tips? I have a more detailed write-up here: Running PyTorch on the M1 GPU And a link to the code examples here on GitHub. submitted by /u/seraschka [link] [comments]  ( 6 min )
    [D] Test set perfomace metrics better than cross validated train set metric
    I have a very limited biological dataset with the dimensions like 200 x 700. I am making regression models with 170 train data and 30 in test data. Main the cross validated models in caret give me performance that are about 50% lower( r2, RMSE mae) than when i test my model on the test set. I understand that that this might be due to bad data split with the test set having 'simpler' samples and the data is obviously limited. What other reasons could be there? Is there a way to rectify this? submitted by /u/triary95 [link] [comments]  ( 3 min )
    [R] Finetuning fairseq M2M100 on specific language dataset
    https://arxiv.org/abs/2010.11125 Hello, I want to finetune M2M100 on specific language. M2M is a multilingual model of 12B parameters trained on 7.2B data from 100languages. It is able to directly translate in every pair using both zero shot learning and bridge between pairs I want to increase performance on a particular pair (eg en-fr) but when training for only 1 epoch the model seems to have forgotten every other pairs. Do you know why? I have tried to adversarial training to reduce the loss in other languages but it does not work. Does it reallh make sense to finetune a multilingual model and hoping other languages will not be forgotten? Thanks ! submitted by /u/m_grosso_ [link] [comments]  ( 1 min )
    [D] Why do you think transformers are not much used in ML over EHR (health)?
    I have been researching about the SOTA in ML over EHR data (e.g.) and it seems that most of the relevant approaches are based on LSTM or GRU. Nowadays, that sequences are mostly handled by transformers, why do you think that in EHR it is still at this point? I have some possible reasons: Transformers generally need lots of data, something that may be difficult in healthcare context (privacy stuff) Transformer FC layers may suppose a big drawback for small devices that aim to deliver live observations on monitoring data, thus pushing the research toward this kind of architecture Intrinsically the proposed solutions for time handling, interpretation, etc are engineered in the context of RNN Just because of the natural delay of the technology in the medical field However, I would like to listen to more points of view. What do you think? submitted by /u/MrLeylo [link] [comments]  ( 3 min )
    [Research] Syncing video with sensor data for creating a learning dataset
    Hello, I'm working on an autonomous vehicle project in university and my problem is to create an indoor driving dataset. There's a video feed from a camera , odometry data from two optocouplers and an IMU (gyro and accel). I have collected data from the encoders and IMU using PuTTy before and it synced the data, for me and packaged it in a nice .csv file. Is there a similar free/opensource tool for that if not how does one sync a videofeed on a vehicle with sensor data from IMUs , encoders and other sensors. any suggestions on how I can integrate all of that in a neat .csv file etc. Thanks in advance. ​ https://preview.redd.it/dzgsrlectd091.png?width=1060&format=png&auto=webp&s=26a6e72aee194fcc99def983d8ba0ca39d61dc34 submitted by /u/forzavettel77 [link] [comments]  ( 1 min )
    [D] Measuring similarity between different neural network architectures
    Lets say we have 3 different network architectures for image classification (architectures A, B, C). I want to determine if the structure of A is more similar to B or to C. For example, one possible way would be to compute a similarity metric between each pair of networks I want to compare. Another way would be to generate an embedding vector for each architecture and measure the distance between them. I would also like to take in account the differences in the hyperparameters of each layer (for example, number of convolutional filters, kernel size, etc) when measuring similarity as well. Are there any openly available algorithms or libraries that are capable of doing so? Thanks. submitted by /u/unguided_deepness [link] [comments]  ( 1 min )
    [D] KDD rejection ventilation
    KDD is out, and she rejected me. How are you feeling? Should I propose to CIKM instead? submitted by /u/HoboHash [link] [comments]  ( 1 min )
    Worst part of machine learning job? [Discussion]
    For people who work in industry, what is the hardest part about your machine learning workflow? Like, what is the most time consuming, expensive, or annoying part? submitted by /u/Puzzleheaded_Cup1367 [link] [comments]  ( 1 min )
    [D] Unpopular Opinion - the new arxiv sanity sucks
    Please give me the old website back. I liked being able to see the top papers in the last week and month. All of that is gone now!!!!!!!! submitted by /u/Big_psuenis [link] [comments]
  • Open

    As part of my senior thesis, I developed an Open AI Gym and PettingZoo custom environment that implements a number of Stag Hunt-like interactions. Wanted to share it with the community and hopefully get some thoughts and feedback!
    submitted by /u/Defaul7 [link] [comments]  ( 1 min )
    Unity ML-Agents: NullReferenceException with Camera Sensor
    Hi guys! I'm trying to set up an ML-Agent to use visual input via the Camera Sensor but for some reason I'm getting the following error: ​ NullReferenceException: Object reference not set to an instance of an object Unity.MLAgents.Sensors.CameraSensor.ObservationToTexture (UnityEngine.Camera obsCamera, UnityEngine.Texture2D texture2D, System.Int32 width, System.Int32 height) (at Library/PackageCache/com.unity.ml-agents@2.0.1/Runtime/Sensors/CameraSensor.cs:158) ​ I'm new to Unity so I'm struggling to figure out what's going on here... I saw there's a field called Target Texture in the Camera GameObject so I created a Render Texture with the default settings and added it to that, but I'm still getting the same error... Any help would be greatly appreciated!! 🙏🏽 submitted by /u/leozinho2r [link] [comments]  ( 1 min )
    Any ideas on how might be best to encode a graph of a street network for a machine learning model?
    I’ve been working on this for awhile, about 8 weeks, and this problem is one I knew I’d eventually hit a hard point at. My graph is composed of spatial data (nodes and edges), and there are no isolated elements. The nodes and edges are comprised of coordinates that describe their location, and their structure describes whether it’s a point or a line on the map. Points are intersections or road ends, and lines are the roads connecting them. If you’re familiar, I’m working with OSM data. I’ve selected some random nodes to serve in place as drop-off locations and gas stations. I’ve created a system that works by querying a city to get it’s road network, then specifying a number of gas stations and drop-off locations to use, and it’ll begin using a reinforcement learning algorithm to discover the most efficient route between every drop-off location. It’s basically applying reinforcement learning to the traveling salesman problem in a way that can be fitted to any city on Earth. However, I am struggling figuring out the best way to encode the details of this map (drop off locations, gas stations, nodes, edges, distances, etc) so that it yields valuable information to an algorithm. I’d imagine this would require developing some mathematical way to correlate different parts of the map. I’ve thought about creating a data structure of polygons made by using the neighbors of each node as the boundaries. Using triangle geometry, it sounds like it would enable an algorithm to correlate more bits of the map. However, this is also very computationally expensive and doesn’t scale well. Does anyone with more experience working with spatial data have any pointers they can toss me? submitted by /u/professorDissociate [link] [comments]  ( 2 min )
    DQN keep return same action for most of the state
    Things that I tried softmax policy when choosing action crazy low epsilon decay rate, even tried to run a tons of episodes with 1.0 epsilon n step learning, cuz sparse reward what else can help to deal with this prob? submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
  • Open

    Happiness and the idea of a Lobotomy
    A Book on Happiness Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 3 min )
    The Banana Test, Sex & Death
    WHAT IF Adam and Eve were AI created by us? Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 5 min )
    How to Write an Essay on Any Topic Using AI
    AI systems, like Jasper AI, can write essays on any topic, just with one click- you don’t need to be an expert in writing and stay up late…  ( 4 min )
  • Open

    Detect social media fake news using graph machine learning with Amazon Neptune ML
    In recent years, social media has become a common means for sharing and consuming news. However, the spread of misinformation and fake news on these platforms has posed a major challenge to the well-being of individuals and societies. Therefore, it is imperative that we develop robust and automated solutions for early detection of fake news […]  ( 11 min )
    Optimize F1 aerodynamic geometries via Design of Experiments and machine learning
    FORMULA 1 (F1) cars are the fastest regulated road-course racing vehicles in the world. Although these open-wheel automobiles are only 20–30 kilometers (or 12–18 miles) per-hour faster than top-of-the-line sports cars, they can speed around corners up to five times as fast due to the powerful aerodynamic downforce they create. Downforce is the vertical force […]  ( 10 min )
    Build a risk management machine learning workflow on Amazon SageMaker with no code
    Since the global financial crisis, risk management has taken a major role in shaping decision-making for banks, including predicting loan status for potential customers. This is often a data-intensive exercise that requires machine learning (ML). However, not all organizations have the data science resources and expertise to build a risk management ML workflow. Amazon SageMaker […]  ( 9 min )
  • Open

    ‘Fortnite’ Arrives This GFN Thursday With GeForce Performance You Can Touch
    Fortnite on GeForce NOW with touch controls on mobile is now available to all members, streaming through the Safari web browser on iOS and the GeForce NOW Android app. The full launch — including the removal of the waitlist — follows a successful beta period in which more than 500,000 participants streamed over 4 million Read article > The post ‘Fortnite’ Arrives This GFN Thursday With GeForce Performance You Can Touch appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Types of tasks in Machine Learning 👇
    submitted by /u/mr-minion [link] [comments]
    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated)
    submitted by /u/maneesh123456 [link] [comments]
    Image-blending neural network
    Hi everyone Can someone give a link to the neural network that takes several source images and merges them into one. I know there's one and seen it working about a half year ago. But I'm unable to find it now. All I've found is imageblender.com which does a similar thing though it uses a text description as an input. submitted by /u/cha_zz [link] [comments]  ( 1 min )
  • Open

    The Foundation of Data Fabrics and AI: Semantic Knowledge Graphs
    Data management agility has become of key importance to organizations as the amount and complexity of data continues to increase, along with the desire to avoid creating new data silos. The concept of creating a ‘data fabric’ as an agile design concept has been proposed by leading analysts, such as Mark Beyer, Distinguished VP Analyst… Read More »The Foundation of Data Fabrics and AI: Semantic Knowledge Graphs The post The Foundation of Data Fabrics and AI: Semantic Knowledge Graphs appeared first on Data Science Central.  ( 4 min )
    Artificial Intelligence and the Future of Medical Imaging
    Artificial intelligence (AI) is the imitation of human intelligence progressions by machines, mainly computer systems. Artificial intelligence has extensive applications in the healthcare sector. AI solutions assist healthcare providers in several aspects of patient care and administrative processes. Medical imaging can be defined as the diagnostic procedure that encompasses the formation of visual assistance and… Read More »Artificial Intelligence and the Future of Medical Imaging The post Artificial Intelligence and the Future of Medical Imaging appeared first on Data Science Central.  ( 3 min )
    How to Use the Resources of MQL5.community to Empower Your Own Business
    Since its inception, algorithmic trading has been a popular strategy for investors. It uses mathematical rules to automate the trading of various assets, such as stocks and futures. However, it has been very challenging for people who don’t have the necessary skills and knowledge — as reported by Psychology Today. According to Nasdaq, one of… Read More »How to Use the Resources of MQL5.community to Empower Your Own Business The post How to Use the Resources of MQL5.community to Empower Your Own Business appeared first on Data Science Central.  ( 4 min )
  • Open

    Enhancing the Transformer Decoder with Transition-based Syntax. (arXiv:2101.12640v3 [cs.CL] UPDATED)
    Notwithstanding recent advances, syntactic generalization remains a challenge for text decoders. While some studies showed gains from incorporating source-side symbolic syntactic and semantic structure into text generation Transformers, very little work addressed the decoding of such structure. We propose a general approach for tree decoding using a transition-based approach. Examining the challenging test case of incorporating Universal Dependencies syntax into machine translation, we present substantial improvements on test sets that focus on syntactic generalization, while presenting improved or comparable performance on standard MT benchmarks. Further qualitative analysis addresses cases where syntactic generalization in the vanilla Transformer decoder is inadequate and demonstrates the advantages afforded by integrating syntactic information.  ( 2 min )
    A Central Limit Theorem, Loss Aversion and Multi-Armed Bandits. (arXiv:2106.05472v2 [math.PR] UPDATED)
    This paper studies a multi-armed bandit problem where the decision-maker is loss averse, in particular she is risk averse in the domain of gains and risk loving in the domain of losses. The focus is on large horizons. Consequences of loss aversion for asymptotic (large horizon) properties are derived in a number of analytical results. The analysis is based on a new central limit theorem for a set of measures under which conditional variances can vary in a largely unstructured history-dependent way subject only to the restriction that they lie in a fixed interval.  ( 2 min )
    A Unified Linear Speedup Analysis of Stochastic FedAvg and Nesterov Accelerated FedAvg. (arXiv:2007.05690v3 [cs.LG] UPDATED)
    Federated learning (FL) learns a model jointly from a set of participating devices without sharing each other's privately held data. The characteristics of non-i.i.d. data across the network, low device participation, high communication costs, and the mandate that data remain private bring challenges in understanding the convergence of FL algorithms, particularly with regards to how convergence scales with the number of participating devices. In this paper, we focus on Federated Averaging (FedAvg)--arguably the most popular and effective FL algorithm class in use today--and provide a unified and comprehensive study of its convergence rate. Although FedAvg has recently been studied by an emerging line of literature, a systematic study of how FedAvg's convergence scales with the number of participating devices in the fully heterogeneous FL setting is lacking--a crucial issue whose answer would shed light on the performance of FedAvg in large FL systems in practice. We fill this gap by providing a unified analysis that establishes convergence guarantees for FedAvg under strongly convex smooth, convex smooth problems, and overparameterized strongly convex smooth problems. We show that FedAvg enjoys linear speedup in each case, although with different convergence rates and communication efficiencies. While there have been linear speedup results from distributed optimization that assumes full participation, ours are the first to establish linear speedup for FedAvg under both statistical and system heterogeneity. For strongly convex and convex problems, we also characterize the corresponding convergence rates for the Nesterov accelerated FedAvg algorithm, which are the first linear speedup guarantees for momentum variants of FedAvg in convex settings. Empirical studies of the algorithms in various settings have supported our theoretical results.  ( 3 min )
    Fair and Green Hyperparameter Optimization via Multi-objective and Multiple Information Source Bayesian Optimization. (arXiv:2205.08835v1 [cs.LG])
    There is a consensus that focusing only on accuracy in searching for optimal machine learning models amplifies biases contained in the data, leading to unfair predictions and decision supports. Recently, multi-objective hyperparameter optimization has been proposed to search for machine learning models which offer equally Pareto-efficient trade-offs between accuracy and fairness. Although these approaches proved to be more versatile than fairness-aware machine learning algorithms -- which optimize accuracy constrained to some threshold on fairness -- they could drastically increase the energy consumption in the case of large datasets. In this paper we propose FanG-HPO, a Fair and Green Hyperparameter Optimization (HPO) approach based on both multi-objective and multiple information source Bayesian optimization. FanG-HPO uses subsets of the large dataset (aka information sources) to obtain cheap approximations of both accuracy and fairness, and multi-objective Bayesian Optimization to efficiently identify Pareto-efficient machine learning models. Experiments consider two benchmark (fairness) datasets and two machine learning algorithms (XGBoost and Multi-Layer Perceptron), and provide an assessment of FanG-HPO against both fairness-aware machine learning algorithms and hyperparameter optimization via a multi-objective single-source optimization algorithm in BoTorch, a state-of-the-art platform for Bayesian Optimization.  ( 2 min )
    Conformalized Online Learning: Online Calibration Without a Holdout Set. (arXiv:2205.09095v1 [cs.LG])
    We develop a framework for constructing uncertainty sets with a valid coverage guarantee in an online setting, in which the underlying data distribution can drastically -- and even adversarially -- shift over time. The technique we propose is highly flexible as it can be integrated with any online learning algorithm, requiring minimal implementation effort and computational cost. A key advantage of our method over existing alternatives -- which also build on conformal inference -- is that we do not need to split the data into training and holdout calibration sets. This allows us to fit the predictive model in a fully online manner, utilizing the most recent observation for constructing calibrated uncertainty sets. Consequently, and in contrast with existing techniques, (i) the sets we build can quickly adapt to new changes in the distribution; and (ii) our procedure does not require refitting the model at each time step. Using synthetic and real-world benchmark data sets, we demonstrate the validity of our theory and the improved performance of our proposal over existing techniques. To demonstrate the greater flexibility of the proposed method, we show how to construct valid intervals for a multiple-output regression problem that previous sequential calibration methods cannot handle due to impractical computational and memory requirements.  ( 2 min )
    Constraining the Attack Space of Machine Learning Models with Distribution Clamping Preprocessing. (arXiv:2205.08989v1 [cs.LG])
    Preprocessing and outlier detection techniques have both been applied to neural networks to increase robustness with varying degrees of success. In this paper, we formalize the ideal preprocessor function as one that would take any input and set it to the nearest in-distribution input. In other words, we detect any anomalous pixels and set them such that the new input is in-distribution. We then illustrate a relaxed solution to this problem in the context of patch attacks. Specifically, we demonstrate that we can model constraints on the patch attack that specify regions as out of distribution. With these constraints, we are able to preprocess inputs successfully, increasing robustness on CARLA object detection.  ( 2 min )
    SimCSE: Simple Contrastive Learning of Sentence Embeddings. (arXiv:2104.08821v4 [cs.CL] UPDATED)
    This paper presents SimCSE, a simple contrastive learning framework that greatly advances state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We find that dropout acts as minimal data augmentation, and removing it leads to a representation collapse. Then, we propose a supervised approach, which incorporates annotated pairs from natural language inference datasets into our contrastive learning framework by using "entailment" pairs as positives and "contradiction" pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT base achieve an average of 76.3% and 81.6% Spearman's correlation respectively, a 4.2% and 2.2% improvement compared to the previous best results. We also show -- both theoretically and empirically -- that the contrastive learning objective regularizes pre-trained embeddings' anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.  ( 2 min )
    GeoPointGAN: Synthetic Spatial Data with Local Label Differential Privacy. (arXiv:2205.08886v1 [cs.LG])
    Synthetic data generation is a fundamental task for many data management and data science applications. Spatial data is of particular interest, and its sensitive nature often leads to privacy concerns. We introduce GeoPointGAN, a novel GAN-based solution for generating synthetic spatial point datasets with high utility and strong individual level privacy guarantees. GeoPointGAN's architecture includes a novel point transformation generator that learns to project randomly generated point co-ordinates into meaningful synthetic co-ordinates that capture both microscopic (e.g., junctions, squares) and macroscopic (e.g., parks, lakes) geographic features. We provide our privacy guarantees through label local differential privacy, which is more practical than traditional local differential privacy. We seamlessly integrate this level of privacy into GeoPointGAN by augmenting the discriminator to the point level and implementing a randomized response-based mechanism that flips the labels associated with the 'real' and 'fake' points used in training. Extensive experiments show that GeoPointGAN significantly outperforms recent solutions, improving by up to 10 times compared to the most competitive baseline. We also evaluate GeoPointGAN using range, hotspot, and facility location queries, which confirm the practical effectiveness of GeoPointGAN for privacy-preserving querying. The results illustrate that a strong level of privacy is achieved with little-to-no adverse utility cost, which we explain through the generalization and regularization effects that are realized by flipping the labels of the data during training.  ( 2 min )
    Pluralistic Image Completion with Probabilistic Mixture-of-Experts. (arXiv:2205.09086v1 [cs.CV])
    Pluralistic image completion focuses on generating both visually realistic and diverse results for image completion. Prior methods enjoy the empirical successes of this task. However, their used constraints for pluralistic image completion are argued to be not well interpretable and unsatisfactory from two aspects. First, the constraints for visual reality can be weakly correlated to the objective of image completion or even redundant. Second, the constraints for diversity are designed to be task-agnostic, which causes the constraints to not work well. In this paper, to address the issues, we propose an end-to-end probabilistic method. Specifically, we introduce a unified probabilistic graph model that represents the complex interactions in image completion. The entire procedure of image completion is then mathematically divided into several sub-procedures, which helps efficient enforcement of constraints. The sub-procedure directly related to pluralistic results is identified, where the interaction is established by a Gaussian mixture model (GMM). The inherent parameters of GMM are task-related, which are optimized adaptively during training, while the number of its primitives can control the diversity of results conveniently. We formally establish the effectiveness of our method and demonstrate it with comprehensive experiments.  ( 2 min )
    Bridging the gap between QP-based and MPC-based RL. (arXiv:2205.08856v1 [eess.SY])
    Reinforcement learning methods typically use Deep Neural Networks to approximate the value functions and policies underlying a Markov Decision Process. Unfortunately, DNN-based RL suffers from a lack of explainability of the resulting policy. In this paper, we instead approximate the policy and value functions using an optimization problem, taking the form of Quadratic Programs (QPs). We propose simple tools to promote structures in the QP, pushing it to resemble a linear MPC scheme. A generic unstructured QP offers high flexibility for learning, while a QP having the structure of an MPC scheme promotes the explainability of the resulting policy, additionally provides ways for its analysis. The tools we propose allow for continuously adjusting the trade-off between the former and the latter during learning. We illustrate the workings of our proposed method with the resulting structure using a point-mass task.  ( 2 min )
    Imagining new futures beyond predictive systems in child welfare: A qualitative study with impacted stakeholders. (arXiv:2205.08928v1 [cs.HC])
    Child welfare agencies across the United States are turning to data-driven predictive technologies (commonly called predictive analytics) which use government administrative data to assist workers' decision-making. While some prior work has explored impacted stakeholders' concerns with current uses of data-driven predictive risk models (PRMs), less work has asked stakeholders whether such tools ought to be used in the first place. In this work, we conducted a set of seven design workshops with 35 stakeholders who have been impacted by the child welfare system or who work in it to understand their beliefs and concerns around PRMs, and to engage them in imagining new uses of data and technologies in the child welfare system. We found that participants worried current PRMs perpetuate or exacerbate existing problems in child welfare. Participants suggested new ways to use data and data-driven tools to better support impacted communities and suggested paths to mitigate possible harms of these tools. Participants also suggested low-tech or no-tech alternatives to PRMs to address problems in child welfare. Our study sheds light on how researchers and designers can work in solidarity with impacted communities, possibly to circumvent or oppose child welfare agencies.  ( 2 min )
    On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias. (arXiv:2205.09072v1 [cs.LG])
    We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with $r$ neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a network achieving perfect training accuracy and having at most $\mathcal{O}(r)$ linear regions, implying a generalization bound. Our result may already hold for mild over-parameterization, where the width is $\tilde{\mathcal{O}}(r)$ and independent of the sample size.  ( 2 min )
    Large Neural Networks Learning from Scratch with Very Few Data and without Regularization. (arXiv:2205.08836v1 [cs.CV])
    Recent findings have shown that Neural Networks generalize also in over-parametrized regimes with zero training error. This is surprising, since it is completely against traditional machine learning wisdom. In our empirical study we fortify these findings in the domain of fine-grained image classification. We show that very large Convolutional Neural Networks with millions of weights do learn with only a handful of training samples and without image augmentation, explicit regularization or pretraining. We train the architectures ResNet018, ResNet101 and VGG19 on subsets of the difficult benchmark datasets Caltech101, CUB_200_2011, FGVCAircraft, Flowers102 and StanfordCars with 100 classes and more, perform a comprehensive comparative study and draw implications for the practical application of CNNs. Finally, we show that VGG19 with 140 million weights learns to distinguish airplanes and motorbikes up to 95% accuracy with only 20 samples per class.  ( 2 min )
    Structural Extensions of Basis Pursuit: Guarantees on Adversarial Robustness. (arXiv:2205.08955v1 [cs.LG])
    While deep neural networks are sensitive to adversarial noise, sparse coding using the Basis Pursuit (BP) method is robust against such attacks, including its multi-layer extensions. We prove that the stability theorem of BP holds upon the following generalizations: (i) the regularization procedure can be separated into disjoint groups with different weights, (ii) neurons or full layers may form groups, and (iii) the regularizer takes various generalized forms of the $\ell_1$ norm. This result provides the proof for the architectural generalizations of Cazenavette et al. (2021), including (iv) an approximation of the complete architecture as a shallow sparse coding network. Due to this approximation, we settled to experimenting with shallow networks and studied their robustness against the Iterative Fast Gradient Sign Method on a synthetic dataset and MNIST. We introduce classification based on the $\ell_2$ norms of the groups and show numerically that it can be accurate and offers considerable speedups. In this family, linear transformer shows the best performance. Based on the theoretical results and the numerical simulations, we highlight numerical matters that may improve performance further.  ( 2 min )
    Learning latent representations for operational nitrogen response rate prediction. (arXiv:2205.09025v1 [cs.LG])
    Learning latent representations has aided operational decision-making in several disciplines. Its advantages include uncovering hidden interactions in data and automating procedures which were performed manually in the past. Representation learning is also being adopted by earth and environmental sciences. However, there are still subfields that depend on manual feature engineering based on expert knowledge and the use of algorithms which do not utilize the latent space. Relying on those techniques can inhibit operational decision-making since they impose data constraints and inhibit automation. In this work, we adopt a case study for nitrogen response rate prediction and examine if representation learning can be used for operational use. We compare a Multilayer Perceptron, an Autoencoder, and a dual-head Autoencoder with a reference Random Forest model for nitrogen response rate prediction. To bring the predictions closer to an operational setting we assume absence of future weather data, and we are evaluating the models using error metrics and a domain-derived error threshold. The results show that learning latent representations can provide operational nitrogen response rate predictions by offering performance equal and sometimes better than the reference model.  ( 2 min )
    Exploring the Advantages of Dense-Vector to One-Hot Encoding of Intent Classes in Out-of-Scope Detection Tasks. (arXiv:2205.09021v1 [cs.LG])
    This work explores the intrinsic limitations of the popular one-hot encoding method in classification of intents when detection of out-of-scope (OOS) inputs is required. Although recent work has shown that there can be significant improvements in OOS detection when the intent classes are represented as dense-vectors based on domain specific knowledge, we argue in this paper that such gains are more likely due to advantages of dense-vector to one-hot encoding methods in representing the complexity of the OOS space. We start by showing how dense-vector encodings can create OOS spaces with much richer topologies than one-hot encoding methods. We then demonstrate empirically, using four standard intent classification datasets, that knowledge-free, randomly generated dense-vector encodings of intent classes can yield massive, over 20% gains over one-hot encodings, and also outperform the previous, domain knowledge-based, SOTA of one of the datasets. We finish by describing a novel algorithm to search for good dense-vector encodings and present initial but promising experimental results of its use.  ( 2 min )
    Generating Explanations from Deep Reinforcement Learning Using Episodic Memory. (arXiv:2205.08926v1 [cs.AI])
    Deep Reinforcement Learning (RL) involves the use of Deep Neural Networks (DNNs) to make sequential decisions in order to maximize reward. For many tasks the resulting sequence of actions produced by a Deep RL policy can be long and difficult to understand for humans. A crucial component of human explanations is selectivity, whereby only key decisions and causes are recounted. Imbuing Deep RL agents with such an ability would make their resulting policies easier to understand from a human perspective and generate a concise set of instructions to aid the learning of future agents. To this end we use a Deep RL agent with an episodic memory system to identify and recount key decisions during policy execution. We show that these decisions form a short, human readable explanation that can also be used to speed up the learning of naive Deep RL agents in an algorithm-independent manner.  ( 2 min )
    World Value Functions: Knowledge Representation for Multitask Reinforcement Learning. (arXiv:2205.08827v1 [cs.LG])
    An open problem in artificial intelligence is how to learn and represent knowledge that is sufficient for a general agent that needs to solve multiple tasks in a given world. In this work we propose world value functions (WVFs), which are a type of general value function with mastery of the world - they represent not only how to solve a given task, but also how to solve any other goal-reaching task. To achieve this, we equip the agent with an internal goal space defined as all the world states where it experiences a terminal transition - a task outcome. The agent can then modify task rewards to define its own reward function, which provably drives it to learn how to achieve all achievable internal goals, and the value of doing so in the current task. We demonstrate a number of benefits of WVFs. When the agent's internal goal space is the entire state space, we demonstrate that the transition function can be inferred from the learned WVF, which allows the agent to plan using learned value functions. Additionally, we show that for tasks in the same world, a pretrained agent that has learned any WVF can then infer the policy and value function for any new task directly from its rewards. Finally, an important property for long-lived agents is the ability to reuse existing knowledge to solve new tasks. Using WVFs as the knowledge representation for learned tasks, we show that an agent is able to solve their logical combination zero-shot, resulting in a combinatorially increasing number of skills throughout their lifetime.  ( 2 min )
    Position Aided Beam Prediction in the Real World: How Useful GPS Locations Actually Are?. (arXiv:2205.09054v1 [eess.SP])
    Millimeter-wave (mmWave) communication systems rely on narrow beams for achieving sufficient receive signal power. Adjusting these beams is typically associated with large training overhead, which becomes particularly critical for highly-mobile applications. Intuitively, since optimal beam selection can benefit from the knowledge of the positions of communication terminals, there has been increasing interest in leveraging position data to reduce the overhead in mmWave beam prediction. Prior work, however, studied this problem using only synthetic data that generally does not accurately represent real-world measurements. In this paper, we investigate position-aided beam prediction using a real-world large-scale dataset to derive insights into precisely how much overhead can be saved in practice. Furthermore, we analyze which machine learning algorithms perform best, what factors degrade inference performance in real data, and which machine learning metrics are more meaningful in capturing the actual communication system performance.  ( 2 min )
    Sharp asymptotics on the compression of two-layer neural networks. (arXiv:2205.08199v2 [cs.IT] UPDATED)
    In this paper, we study the compression of a target two-layer neural network with N nodes into a compressed network with M < N nodes. More precisely, we consider the setting in which the weights of the target network are i.i.d. sub-Gaussian, and we minimize the population L2 loss between the outputs of the target and of the compressed network, under the assumption of Gaussian inputs. By using tools from high-dimensional probability, we show that this non-convex problem can be simplified when the target network is sufficiently over-parameterized, and provide the error rate of this approximation as a function of the input dimension and N . For a ReLU activation function, we conjecture that the optimum of the simplified optimization problem is achieved by taking weights on the Equiangular Tight Frame (ETF), while the scaling of the weights and the orientation of the ETF depend on the parameters of the target network. Numerical evidence is provided to support this conjecture.
    Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml. (arXiv:2205.07690v1 [cs.CV] CROSS LISTED)
    In this paper, we investigate how field programmable gate arrays can serve as hardware accelerators for real-time semantic segmentation tasks relevant for autonomous driving. Considering compressed versions of the ENet convolutional neural network architecture, we demonstrate a fully-on-chip deployment with a latency of 4.9 ms per image, using less than 30% of the available resources on a Xilinx ZCU102 evaluation board. The latency is reduced to 3 ms per image when increasing the batch size to ten, corresponding to the use case where the autonomous vehicle receives inputs from multiple cameras simultaneously. We show, through aggressive filter reduction and heterogeneous quantization-aware training, and an optimized implementation of convolutional layers, that the power consumption and resource utilization can be significantly reduced while maintaining accuracy on the Cityscapes dataset.
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v2 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). Then, we approximate the resultant GP by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). ICK is flexible and can be used to include prior information in neural networks in many applications. We demonstrate the strength of our framework by showing its superior performance and flexibility on both synthetic and real-world data sets. The code is available at: https://anonymous.4open.science/r/ICK_NNGP-17C5/.
    "What makes a question inquisitive?" A Study on Type-Controlled Inquisitive Question Generation. (arXiv:2205.08056v2 [cs.CL] UPDATED)
    We propose a type-controlled framework for inquisitive question generation. We annotate an inquisitive question dataset with question types, train question type classifiers, and finetune models for type-controlled question generation. Empirical results demonstrate that we can generate a variety of questions that adhere to specific types while drawing from the source texts. We also investigate strategies for selecting a single question from a generated set, considering both an informative vs.~inquisitive question classifier and a pairwise ranker trained from a small set of expert annotations. Question selection using the pairwise ranker yields strong results in automatic and manual evaluation. Our human evaluation assesses multiple aspects of the generated questions, finding that the ranker chooses questions with the best syntax (4.59), semantics (4.37), and inquisitiveness (3.92) on a scale of 1-5, even rivaling the performance of human-written questions.
    A Comparative Analysis of Machine Learning Techniques for IoT Intrusion Detection. (arXiv:2111.13149v2 [cs.CR] CROSS LISTED)
    The digital transformation faces tremendous security challenges. In particular, the growing number of cyber-attacks targeting Internet of Things (IoT) systems restates the need for a reliable detection of malicious network activity. This paper presents a comparative analysis of supervised, unsupervised and reinforcement learning techniques on nine malware captures of the IoT-23 dataset, considering both binary and multi-class classification scenarios. The developed models consisted of Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), Isolation Forest (iForest), Local Outlier Factor (LOF) and a Deep Reinforcement Learning (DRL) model based on a Double Deep Q-Network (DDQN), adapted to the intrusion detection context. The most reliable performance was achieved by LightGBM. Nonetheless, iForest displayed good anomaly detection results and the DRL model demonstrated the possible benefits of employing this methodology to continuously improve the detection. Overall, the obtained results indicate that the analyzed techniques are well suited for IoT intrusion detection.
    GPU-accelerated partially linear multiuser detection for 5G and beyond URLLC systems. (arXiv:2201.05024v3 [eess.SP] UPDATED)
    In this feasibility study, we have implemented a recently proposed partially linear multiuser detection algorithm in reproducing kernel Hilbert spaces (RKHSs) on a GPU-accelerated platform. Partially linear multiuser detection, which combines the robustness of linear detection with the power of nonlinear methods, has been proposed for a massive connectivity scenario with the non-orthogonal multiple access (NOMA). This is a promising approach, but detecting payloads within a received orthogonal frequency division multiplexing (OFDM) radio frame requires the execution of a large number of inner product operations, which are the main computational burden of the algorithm. Although inner-product operations consist of simple kernel evaluations, their vast number poses a challenge in ultra-low latency (ULL) applications, because the time needed for computing the inner products might exceed the sub-millisecond latency requirement. To address this problem, this study demonstrates the acceleration of the inner-product operations through massive parallelization. The result is a GPU-accelerated real-time OFDM receiver that enables sub-millisecond latency detection to meet the requirements of 5th generation (5G) and beyond ultra-reliable and low latency communications (URLLC) systems. Moreover, the parallelization and acceleration techniques explored and demonstrated in this study can be extended to many other signal processing algorithms in Hilbert spaces, such as those based on projection onto convex sets (POCS) and adaptive projected subgradient method (APSM) algorithms. Experimental results and comparisons with the state-of-art confirm the effectiveness of our techniques.
    MonoTrack: Shuttle trajectory reconstruction from monocular badminton video. (arXiv:2204.01899v2 [cs.CV] UPDATED)
    Trajectory estimation is a fundamental component of racket sport analytics, as the trajectory contains information not only about the winning and losing of each point, but also how it was won or lost. In sports such as badminton, players benefit from knowing the full 3D trajectory, as the height of shuttlecock or ball provides valuable tactical information. Unfortunately, 3D reconstruction is a notoriously hard problem, and standard trajectory estimators can only track 2D pixel coordinates. In this work, we present the first complete end-to-end system for the extraction and segmentation of 3D shuttle trajectories from monocular badminton videos. Our system integrates badminton domain knowledge such as court dimension, shot placement, physical laws of motion, along with vision-based features such as player poses and shuttle tracking. We find that significant engineering efforts and model improvements are needed to make the overall system robust, and as a by-product of our work, improve state-of-the-art results on court recognition, 2D trajectory estimation, and hit recognition.
    Mingling Foresight with Imagination: Model-Based Cooperative Multi-Agent Reinforcement Learning. (arXiv:2204.09418v2 [cs.MA] UPDATED)
    Recently, model-based agents have achieved better performance than model-free ones using the same computational budget and training time in single-agent environments. However, due to the complexity of multi-agent systems, it is tough to learn the model of the environment. The significant compounding error may hinder the learning process when model-based methods are applied to multi-agent tasks. This paper proposes an implicit model-based multi-agent reinforcement learning method based on value decomposition methods. Under this method, agents can interact with the learned virtual environment and evaluate the current state value according to imagined future states in the latent space, making agents have the foresight. Our approach can be applied to any multi-agent value decomposition method. The experimental results show that our method improves the sample efficiency in different partially observable Markov decision process domains.
    Noise mitigation strategies in physical feedforward neural networks. (arXiv:2204.09461v2 [cs.NE] UPDATED)
    Physical neural networks are promising candidates for next generation artificial intelligence hardware. In such architectures, neurons and connections are physically realized and do not leverage digital concepts with their practically infinite signal-to-noise ratio to encode, transduce and transform information. They therefore are prone to noise with a variety of statistical and architectural properties, and effective strategies leveraging network-inherent assets to mitigate noise in an hardware-efficient manner are important in the pursuit of next generation neural network hardware. Based on analytical derivations, we here introduce and analyse a variety of different noise-mitigation approaches. We analytically show that intra-layer connections in which the connection matrix's squared mean exceeds the mean of its square fully suppresses uncorrelated noise. We go beyond and develop two synergistic strategies for noise that is uncorrelated and correlated across populations of neurons. First, we introduce the concept of ghost neurons, where each group of neurons perturbed by correlated noise has a negative connection to a single neuron, yet without receiving any input information. Secondly, we show that pooling of neuron populations is an efficient approach to suppress uncorrelated noise. As such, we developed a general noise mitigation strategy leveraging the statistical properties of the different noise terms most relevant in analogue hardware. Finally, we demonstrate the effectiveness of this combined approach for trained neural network classifying the MNIST handwritten digits, for which we achieve a 4-fold improvement of the output signal-to-noise ratio and increase the classification accuracy almost to the level of the noise-free network.
    Markov Abstractions for PAC Reinforcement Learning in Non-Markov Decision Processes. (arXiv:2205.01053v2 [cs.LG] UPDATED)
    Our work aims at developing reinforcement learning algorithms that do not rely on the Markov assumption. We consider the class of Non-Markov Decision Processes where histories can be abstracted into a finite set of states while preserving the dynamics. We call it a Markov abstraction since it induces a Markov Decision Process over a set of states that encode the non-Markov dynamics. This phenomenon underlies the recently introduced Regular Decision Processes (as well as POMDPs where only a finite number of belief states is reachable). In all such kinds of decision process, an agent that uses a Markov abstraction can rely on the Markov property to achieve optimal behaviour. We show that Markov abstractions can be learned during reinforcement learning. Our approach combines automata learning and classic reinforcement learning. For these two tasks, standard algorithms can be employed. We show that our approach has PAC guarantees when the employed algorithms have PAC guarantees, and we also provide an experimental evaluation.
    Bayesian Inference with Nonlinear Generative Models: Comments on Secure Learning. (arXiv:2201.09986v2 [cs.IT] UPDATED)
    Unlike the classical linear model, nonlinear generative models have been addressed sparsely in the literature. This work aims to bring attention to these models and their secrecy potential. To this end, we invoke the replica method to derive the asymptotic normalized cross entropy in an inverse probability problem whose generative model is described by a Gaussian random field with a generic covariance function. Our derivations further demonstrate the asymptotic statistical decoupling of Bayesian inference algorithms and specify the decoupled setting for a given nonlinear model. The replica solution depicts that strictly nonlinear models establish an all-or-nothing phase transition: There exists a critical load at which the optimal Bayesian inference changes from being perfect to an uncorrelated learning. This finding leads to design of a new secure coding scheme which achieves the secrecy capacity of the wiretap channel. This interesting result implies that strictly nonlinear generative models are perfectly secured without any secure coding. We justify this latter statement through the analysis of an illustrative model for perfectly secure and reliable inference.
    Practical Insights of Repairing Model Problems on Image Classification. (arXiv:2205.07116v1 [cs.LG] CROSS LISTED)
    Additional training of a deep learning model can cause negative effects on the results, turning an initially positive sample into a negative one (degradation). Such degradation is possible in real-world use cases due to the diversity of sample characteristics. That is, a set of samples is a mixture of critical ones which should not be missed and less important ones. Therefore, we cannot understand the performance by accuracy alone. While existing research aims to prevent a model degradation, insights into the related methods are needed to grasp their benefits and limitations. In this talk, we will present implications derived from a comparison of methods for reducing degradation. Especially, we formulated use cases for industrial settings in terms of arrangements of a data set. The results imply that a practitioner should care about better method continuously considering dataset availability and life cycle of an AI system because of a trade-off between accuracy and preventing degradation.
    ACReL: Adversarial Conditional value-at-risk Reinforcement Learning. (arXiv:2109.09470v2 [cs.LG] UPDATED)
    In the classical Reinforcement Learning (RL) setting, one aims to find a policy that maximizes its expected return. This objective may be inappropriate in safety-critical domains such as healthcare or autonomous driving, where intrinsic uncertainties due to stochastic policies and environment variability may lead to catastrophic failures. This can be addressed by using the Conditional-Value-at-Risk (CVaR) objective to instill risk-aversion in learned policies. In this paper, we propose Adversarial Cvar Reinforcement Learning (ACReL), a novel adversarial meta-algorithm to optimize the CVaR objective in RL. ACReL is based on a max-min between a policy player and a learned adversary that perturbs the policy player's state transitions given a finite budget. We prove that, the closer the players are to the game's equilibrium point, the closer the learned policy is to the CVaR-optimal one with a risk tolerance explicitly related to the adversary's budget. We provide a gradient-based training procedure to solve the proposed game by formulating it as a Stackelberg game, enabling the use of deep RL architectures and training algorithms. Empirical experiments show that ACReL matches a CVaR RL state-of-the-art baseline for retrieving CVaR optimal policies, while also benefiting from theoretical guarantees.
    FastCover: An Unsupervised Learning Framework for Multi-Hop Influence Maximization in Social Networks. (arXiv:2111.00463v2 [cs.SI] UPDATED)
    Finding influential users in social networks is a fundamental problem with many possible useful applications. Viewing the social network as a graph, the influence of a set of users can be measured by the number of neighbors located within a given number of hops in the network, where each hop marks a step of influence diffusion. In this paper, we reduce the problem of IM to a budget-constrained d-hop dominating set problem (kdDSP). We propose a unified machine learning (ML) framework, FastCover, to solve kdDSP by learning an efficient greedy strategy in an unsupervised way. As one critical component of the framework, we devise a novel graph neural network (GNN) architecture, graph reversed attention network (GRAT), that captures the diffusion process among neighbors. Unlike most heuristic algorithms and concurrent ML frameworks for combinatorial optimization problems, FastCover determines the entire seed set from the nodes' scores computed with only one forward propagation of the GNN and has a time complexity quasi-linear in the graph size. Experiments on synthetic graphs and real-world social networks demonstrate that FastCover finds solutions with better or comparable quality rendered by the concurrent algorithms while achieving a speedup of over 1000x.
    SLISEMAP: Supervised dimensionality reduction through local explanations. (arXiv:2201.04455v2 [cs.LG] UPDATED)
    Existing methods for explaining black box learning models often focus on building local explanations of model behaviour for a particular data item. It is possible to create global explanations for all data items, but these explanations generally have low fidelity for complex black box models. We propose a new supervised manifold visualisation method, SLISEMAP, that simultaneously finds local explanations for all data items and builds a (typically) two-dimensional global visualisation of the black box model such that data items with similar local explanations are projected nearby. We provide a mathematical derivation of our problem and an open source implementation implemented using the GPU-optimised PyTorch library. We compare SLISEMAP to multiple popular dimensionality reduction methods and find that SLISEMAP is able to utilise labelled data to create embeddings with consistent local white box models. We also compare SLISEMAP to other model-agnostic local explanation methods and show that SLISEMAP provides comparable explanations and that the visualisations can give a broader understanding of black box regression and classification models.
    Preserving Privacy and Security in Federated Learning. (arXiv:2202.03402v2 [cs.LG] UPDATED)
    Federated learning is known to be vulnerable to both security and privacy issues. Existing research has focused either on preventing poisoning attacks from users or on concealing the local model updates from the server, but not both. However, integrating these two lines of research remains a crucial challenge since they often conflict with one another with respect to the threat model. In this work, we develop a principle framework that offers both privacy guarantees for users and detection against poisoning attacks from them. With a new threat model that includes both an honest-but-curious server and malicious users, we first propose a secure aggregation protocol using homomorphic encryption for the server to combine local model updates in a private manner. Then, a zero-knowledge proof protocol is leveraged to shift the task of detecting attacks in the local models from the server to the users. The key observation here is that the server no longer needs access to the local models for attack detection. Therefore, our framework enables the central server to identify poisoned model updates without violating the privacy guarantees of secure aggregation.
    Probing Pretrained Models of Source Code. (arXiv:2202.08975v2 [cs.SE] UPDATED)
    Deep learning models are widely used for solving challenging code processing tasks, such as code generation or code summarization. Traditionally, a specific model architecture was carefully built to solve a particular code processing task. However, recently general pretrained models such as CodeBERT or CodeT5 have been shown to outperform task-specific models in many applications. While pretrained models are known to learn complex patterns from data, they may fail to understand some properties of source code. To test diverse aspects of code understanding, we introduce a set of diagnosting probing tasks. We show that pretrained models of code indeed contain information about code syntactic structure and correctness, the notions of identifiers, data flow and namespaces, and natural language naming. We also investigate how probing results are affected by using code-specific pretraining objectives, varying the model size, or finetuning.
    Molformer: Motif-based Transformer on 3D Heterogeneous Molecular Graphs. (arXiv:2110.01191v5 [q-bio.QM] UPDATED)
    Procuring expressive molecular representations underpins AI-driven molecule design and scientific discovery. The research to date mainly focuses on atom-level homogeneous molecular graphs, ignoring the rich information in subgraphs or motifs. However, it has been widely accepted that substructures play a dominant role in the identification and determination of molecular properties. To address such issues, we formulate heterogeneous molecular graphs (HMGs), and introduce Molformer to exploit both molecular motifs and 3D geometry. Specifically, we extract functional groups as motifs for small molecules and resort to the reinforcement learning to adaptively select quaternary amino acids as motifs for proteins. Then HMGs are constructed with both atom-level and motif-level nodes. To better accommodate those HMGs, we introduce a variant of Transformer named Molformer, which adopts a heterogeneous self-attention layer to distinguish the interactions between multi-level nodes. Besides, it is also coupled with a multi-scale mechanism to capture local fine-grained patterns with increasing contextual scales. An attentive farthest point sampling algorithm is also proposed to obtain the molecular representations. We validate Molformer across a few domains including quantum chemistry, physiology, and biophysics. Experiments show that Molformer outperforms state-of-the-art baselines. Our work provides a promising way to utilize informative motifs from the perspective of multi-level graph construction.
    HAVEN: Hierarchical Cooperative Multi-Agent Reinforcement Learning with Dual Coordination Mechanism. (arXiv:2110.07246v2 [cs.MA] UPDATED)
    Multi-agent reinforcement learning often suffers from the exponentially large action space caused by a large number of agents. This paper proposes a novel value decomposition framework HAVEN based on hierarchical reinforcement learning for the fully cooperative multi-agent problems. To address the instability that arises from the concurrent optimization of high-level and low-level policies and another concurrent optimization of agents, we introduce the dual coordination mechanism of inter-layer strategies and inter-agent strategies. HAVEN does not require domain knowledge and pretraining, and can be applied to any value decomposition variant. Our method achieves desirable results on different decentralized partially observable Markov decision process domains and offers an efficient solution to partition the decision space implicitly.
    Doubly Robust Collaborative Targeted Learning for Debiased Recommendations. (arXiv:2203.10258v2 [cs.IR] UPDATED)
    In recommender systems, the collected data always contains various biases and leads to the challenge of accurate predictions. To address selection bias and confounding bias, the doubly robust (DR) method and its variants show superior performance due to the double robustness property and smaller bias under inaccurate propensity and error imputation models. However, we theoretically show that the variance of the error imputation-based (EIB) method is much smaller than that of DR, although EIB may suffer from a much larger bias. In this paper, we propose a doubly robust targeted learning method that effectively combines the small-bias property of DR and the small-variance property of EIB, by leveraging the targeted maximum likelihood estimation technique. Theoretical analysis shows that the proposed targeted learning is effective in reducing the variance of DR while maintaining double robustness. To further reduce the bias and variance during the training process, we propose a novel collaborative targeted learning approach that decomposes imputed errors into parametric and nonparametric parts and updates them collaboratively, resulting in more accurate predictions. Both theoretical analysis and experiments demonstrate the superiority of the proposed methods compared with existing debiasing methods.
    Predicting Berth Stay for Tanker Terminals: A Systematic and Dynamic Approach. (arXiv:2204.04085v2 [cs.CE] UPDATED)
    Given the trend of digitization and increasing number of maritime transport, prediction of vessel berth stay has been triggered for requirements of operation research and scheduling optimization problem in the era of maritime big data, which takes a significant part in port efficiency and maritime logistics enhancement. This study proposes a systematic and dynamic approach of predicting berth stay for tanker terminals. The approach covers three innovative aspects: 1) Data source employed is multi-faceted, including cargo operation data from tanker terminals, time-series data from automatic identification system (AIS), etc. 2) The process of berth stay is decomposed into multiple blocks according to data analysis and information extraction innovatively, and practical operation scenarios are also developed accordingly. 3) The predictive models of berth stay are developed on the basis of prior data analysis and information extraction under two methods, including regression and decomposed distribution. The models are evaluated under four dynamic scenarios with certain designated cargoes among two different terminals. The evaluation results show that the proposed approach can predict berth stay with the accuracy up to 98.81% validated by historical baselines, and also demonstrate the proposed approach has dynamic capability of predicting berth stay among the scenarios. The model may be potentially applied for short-term pilot-booking or scheduling optimizations within a reasonable time frame for advancement of port intelligence and logistics efficiency.
    Decentral and Incentivized Federated Learning Frameworks: A Systematic Literature Review. (arXiv:2205.07855v2 [cs.LG] UPDATED)
    The advent of Federated Learning (FL) has ignited a new paradigm for parallel and confidential decentralized Machine Learning (ML) with the potential of utilizing the computational power of a vast number of IoT, mobile and edge devices without data leaving the respective device, ensuring privacy by design. Yet, in order to scale this new paradigm beyond small groups of already entrusted entities towards mass adoption, the Federated Learning Framework (FLF) has to become (i) truly decentralized and (ii) participants have to be incentivized. This is the first systematic literature review analyzing holistic FLFs in the domain of both, decentralized and incentivized federated learning. 422 publications were retrieved, by querying 12 major scientific databases. Finally, 40 articles remained after a systematic review and filtering process for in-depth examination. Although having massive potential to direct the future of a more distributed and secure AI, none of the analyzed FLF is production-ready. The approaches vary heavily in terms of use-cases, system design, solved issues and thoroughness. We are the first to provide a systematic approach to classify and quantify differences between FLF, exposing limitations of current works and derive future directions for research in this novel domain.
    You Only Cut Once: Boosting Data Augmentation with a Single Cut. (arXiv:2201.12078v2 [cs.CV] UPDATED)
    We present You Only Cut Once (YOCO) for performing data augmentations. YOCO cuts one image into two pieces and performs data augmentations individually within each piece. Applying YOCO improves the diversity of the augmentation per sample and encourages neural networks to recognize objects from partial information. YOCO enjoys the properties of parameter-free, easy usage, and boosting almost all augmentations for free. Thorough experiments are conducted to evaluate its effectiveness. We first demonstrate that YOCO can be seamlessly applied to varying data augmentations, neural network architectures, and brings performance gains on CIFAR and ImageNet classification tasks, sometimes surpassing conventional image-level augmentation by large margins. Moreover, we show YOCO benefits contrastive pre-training toward a more powerful representation that can be better transferred to multiple downstream tasks. Finally, we study a number of variants of YOCO and empirically analyze the performance for respective settings. Code is available at GitHub.
    Exploring Deep Reinforcement Learning-Assisted Federated Learning for Online Resource Allocation in Privacy-Persevering EdgeIoT. (arXiv:2202.07391v2 [cs.LG] UPDATED)
    Federated learning (FL) has been increasingly considered to preserve data training privacy from eavesdropping attacks in mobile edge computing-based Internet of Thing (EdgeIoT). On the one hand, the learning accuracy of FL can be improved by selecting the IoT devices with large datasets for training, which gives rise to a higher energy consumption. On the other hand, the energy consumption can be reduced by selecting the IoT devices with small datasets for FL, resulting in a falling learning accuracy. In this paper, we formulate a new resource allocation problem for privacy-persevering EdgeIoT to balance the learning accuracy of FL and the energy consumption of the IoT device. We propose a new federated learning-enabled twin-delayed deep deterministic policy gradient (FL-DLT3) framework to achieve the optimal accuracy and energy balance in a continuous domain. Furthermore, long short term memory (LSTM) is leveraged in FL-DLT3 to predict the time-varying network state while FL-DLT3 is trained to select the IoT devices and allocate the transmit power. Numerical results demonstrate that the proposed FL-DLT3 achieves fast convergence (less than 100 iterations) while the FL accuracy-to-energy consumption ratio is improved by 51.8% compared to existing state-of-the-art benchmark.
    Ranking of Communities in Multiplex Spatiotemporal Models of Brain Dynamics. (arXiv:2203.09281v2 [q-bio.NC] UPDATED)
    As a relatively new field, network neuroscience has tended to focus on aggregate behaviours of the brain averaged over many successive experiments or over long recordings in order to construct robust brain models. These models are limited in their ability to explain dynamic state changes in the brain which occurs spontaneously as a result of normal brain function. Hidden Markov Models (HMMs) trained on neuroimaging time series data have since arisen as a method to produce dynamical models that are easy to train but can be difficult to fully parametrise or analyse. We propose an interpretation of these neural HMMs as multiplex brain state graph models we term Hidden Markov Graph Models (HMGMs). This interpretation allows for dynamic brain activity to be analysed using the full repertoire of network analysis techniques. Furthermore, we propose a general method for selecting HMM hyperparameters in the absence of external data, based on the principle of maximum entropy, and use this to select the number of layers in the multiplex model. We produce a new tool for determining important communities of brain regions using a spatiotemporal random walk-based procedure that takes advantage of the underlying Markov structure of the model. Our analysis of real multi-subject fMRI data provides new results that corroborate the modular processing hypothesis of the brain at rest as well as contributing new evidence of functional overlap between and within dynamic brain state communities. Our analysis pipeline provides a way to characterise dynamic network activity of the brain under novel behaviours or conditions.
    CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. (arXiv:2103.06874v4 [cs.CL] UPDATED)
    Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.
    Describing Differences between Text Distributions with Natural Language. (arXiv:2201.12323v2 [cs.CL] UPDATED)
    How do two distributions of texts differ? Humans are slow at answering this, since discovering patterns might require tediously reading through hundreds of samples. We propose to automatically summarize the differences by "learning a natural language hypothesis": given two distributions $D_{0}$ and $D_{1}$, we search for a description that is more often true for $D_{1}$, e.g., "is military-related." To tackle this problem, we fine-tune GPT-3 to propose descriptions with the prompt: "[samples of $D_{0}$] + [samples of $D_{1}$] + the difference between them is_____." We then re-rank the descriptions by checking how often they hold on a larger set of samples with a learned verifier. On a benchmark of 54 real-world binary classification tasks, while GPT-3 Curie (13B) only generates a description similar to human annotation 7% of the time, the performance reaches 61% with fine-tuning and re-ranking, and our best system using GPT-3 Davinci (175B) reaches 76%. We apply our system to describe distribution shifts, debug dataset shortcuts, summarize unknown tasks, and label text clusters, and present analyses based on automatically generated descriptions.
    PFGE: Parsimonious Fast Geometric Ensembling of DNNs. (arXiv:2202.06658v5 [cs.LG] UPDATED)
    Ensemble methods have been widely used to improve the performance of machine learning methods in terms of generalization, while they are hard to use in deep learning systems, as training an ensemble of deep neural networks (DNNs) incurs an extremely higher computational overhead of model training. Recently, advanced techniques such as fast geometric ensembling (FGE) and snapshot ensemble have been proposed. These methods can train the model ensembles in the same time as a single model, thus getting around the hurdle of training time. However, their memory overhead for test-time inference remains much higher than single model based methods. Here we propose a parsimonious FGE (PFGE) that employs a lightweight ensemble of higher-performing DNNs, generated by successively-performed stochastic weight averaging procedures. Experimental results across different advanced DNN architectures on benchmark datasets CIFAR-$\{10,100\}$ and Imagenet, demonstrate that PFGE matches the state-of-the-art FGE method in terms of the generalization error, yet requires only 20% memory overhead for test-time inference. Our code is available at https://github.com/ZJLAB-AMMI/PFGE.
    Finite-Sum Coupled Compositional Stochastic Optimization: Theory and Applications. (arXiv:2202.12396v3 [math.OC] UPDATED)
    This paper studies stochastic optimization for a sum of compositional functions, where the inner-level function of each summand is coupled with the corresponding summation index. We refer to this family of problems as finite-sum coupled compositional optimization (FCCO). It has broad applications in machine learning for optimizing non-convex or convex compositional measures/objectives such as average precision (AP), p-norm push, listwise ranking losses, neighborhood component analysis (NCA), deep survival analysis, and deep latent variable models, which deserves finer analysis. Yet, existing algorithms and analyses are restricted in one or other aspects. The contribution of this paper is to provide a comprehensive analysis of a simple stochastic algorithm for both non-convex and convex objectives. The key results are improved oracle complexities with the parallel speed-up by the moving-average based stochastic estimator with mini-batching. Our theoretical analysis also exhibits new insights for improving the practical implementation by sampling the batches of equal size for the outer and inner levels. Numerical experiments on AP maximization, NCA, and p-norm push optimization corroborate some aspects of the theory.
    A simple yet effective baseline for non-attributed graph classification. (arXiv:1811.03508v3 [cs.LG] UPDATED)
    Graphs are complex objects that do not lend themselves easily to typical learning tasks. Recently, a range of approaches based on graph kernels or graph neural networks have been developed for graph classification and for representation learning on graphs in general. As the developed methodologies become more sophisticated, it is important to understand which components of the increasingly complex methods are necessary or most effective. As a first step, we develop a simple yet meaningful graph representation, and explore its effectiveness in graph classification. We test our baseline representation for the graph classification task on a range of graph datasets. Interestingly, this simple representation achieves similar performance as the state-of-the-art graph kernels and graph neural networks for non-attributed graph classification. Its performance on classifying attributed graphs is slightly weaker as it does not incorporate attributes. However, given its simplicity and efficiency, we believe that it still serves as an effective baseline for attributed graph classification. Our graph representation is efficient (linear-time) to compute. We also provide a simple connection with the graph neural networks. Note that these observations are only for the task of graph classification while existing methods are often designed for a broader scope including node embedding and link prediction. The results are also likely biased due to the limited amount of benchmark datasets available. Nevertheless, the good performance of our simple baseline calls for the development of new, more comprehensive benchmark datasets so as to better evaluate and analyze different graph learning methods. Furthermore, given the computational efficiency of our graph summary, we believe that it is a good candidate as a baseline method for future graph classification (or even other graph learning) studies.
    A Scalable AutoML Approach Based on Graph Neural Networks. (arXiv:2111.00083v3 [cs.LG] UPDATED)
    AutoML systems build machine learning models automatically by performing a search over valid data transformations and learners, along with hyper-parameter optimization for each learner. Many AutoML systems use meta-learning to guide search for optimal pipelines. In this work, we present a novel meta-learning system called KGpip which, (1) builds a database of datasets and corresponding pipelines by mining thousands of scripts with program analysis, (2) uses dataset embeddings to find similar datasets in the database based on its content instead of metadata-based features, (3) models AutoML pipeline creation as a graph generation problem, to succinctly characterize the diverse pipelines seen for a single dataset. KGpip's meta-learning is a sub-component for AutoML systems. We demonstrate this by integrating KGpip with two AutoML systems. Our comprehensive evaluation using 126 datasets, including those used by the state-of-the-art systems, shows that KGpip significantly outperforms these systems.
    Model-based Clustering with Missing Not At Random Data. (arXiv:2112.10425v2 [stat.ML] UPDATED)
    Traditional ways for handling missing values are not designed for the clustering purpose and they rarely apply to the general case, though frequent in practice, of Missing Not At Random (MNAR) values. This paper proposes to embed MNAR data directly within model-based clustering algorithms. We introduce a mixture model for different types of data (continuous, count, categorical and mixed) to jointly model the data distribution and the MNAR mechanism. Eight different MNAR models are proposed, which may depend on the underlying (unknown) classes and/or the values of the missing variables themselves. We prove the identifiability of the parameters of both the data distribution and the mechanism, whatever the type of data and the mechanism, and propose an EM or Stochastic EM algorithm to estimate them. The code is available on \url{https://github.com/AudeSportisse/Clustering-MNAR}. %\url{https://anonymous.4open.science/r/Clustering-MNAR-0201} We also prove that MNAR models for which the missingness depends on the class membership have the nice property that the statistical inference can be carried out on the data matrix concatenated with the mask by considering a MAR mechanism instead. Finally, we perform empirical evaluations for the proposed sub-models on synthetic data and we illustrate the relevance of our method on a medical register, the TraumaBase$^{\mbox{\normalsize{\textregistered}}}$ dataset.
    Shape complexity in cluster analysis. (arXiv:2205.08046v2 [cs.LG] UPDATED)
    In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called "midrange" distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.
    Translatotron 2: High-quality direct speech-to-speech translation with voice preservation. (arXiv:2107.08661v5 [cs.CL] UPDATED)
    We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a linguistic decoder, an acoustic synthesizer, and a single attention module that connects them together. Experimental results on three datasets consistently show that Translatotron 2 outperforms the original Translatotron by a large margin on both translation quality (up to +15.5 BLEU) and speech generation quality, and approaches the same of cascade systems. In addition, we propose a simple method for preserving speakers' voices from the source speech to the translation speech in a different language. Unlike existing approaches, the proposed method is able to preserve each speaker's voice on speaker turns without requiring for speaker segmentation. Furthermore, compared to existing approaches, it better preserves speaker's privacy and mitigates potential misuse of voice cloning for creating spoofing audio artifacts.
    Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias. (arXiv:2112.06868v2 [cs.LG] UPDATED)
    Variational Autoencoders are one of the most commonly used generative models, particularly for image data. A prominent difficulty in training VAEs is data that is supported on a lower-dimensional manifold. Recent work by Dai and Wipf (2020) proposes a two-stage training algorithm for VAEs, based on a conjecture that in standard VAE training the generator will converge to a solution with 0 variance which is correctly supported on the ground truth manifold. They gave partial support for that conjecture by showing that some optima of the VAE loss do satisfy this property, but did not analyze the training dynamics. In this paper, we show that for linear encoders/decoders, the conjecture is true-that is the VAE training does recover a generator with support equal to the ground truth manifold-and does so due to an implicit bias of gradient descent rather than merely the VAE loss itself. In the nonlinear case, we show that VAE training frequently learns a higher-dimensional manifold which is a superset of the ground truth manifold.
    Detecting Model Misspecification in Amortized Bayesian Inference with Neural Networks. (arXiv:2112.08866v3 [stat.ME] UPDATED)
    Recent advances in probabilistic deep learning enable amortized Bayesian inference in settings where the likelihood function is implicitly defined by a simulation program. But how faithful is such inference when simulations represent reality somewhat inaccurately? In this paper, we conceptualize the types of model misspecification arising in simulation-based inference and systematically investigate the performance of SNPE-C (APT) and the BayesFlow framework under these misspecifications. We propose an augmented optimization objective which imposes a probabilistic structure on the learned latent data summary space and utilize maximum mean discrepancy (MMD) to detect potentially catastrophic misspecifications during inference undermining the validity of the obtained results. We verify our detection criterion on a number of artificial and realistic misspecifications, ranging from toy conjugate models to complex models of decision making and disease outbreak dynamics applied to real data. Further, we show that posterior inference errors increase when the distance between the latent summary distributions of the true data-generating process and the training simulations grows. Thus, we demonstrate the dual utility of MMD as a method for detecting model misspecification and as a proxy for verifying the faithfulness of amortized simulation-based Bayesian inference.
    A label efficient two-sample test. (arXiv:2111.08861v3 [cs.LG] UPDATED)
    Two-sample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis). We consider a new setting for this problem where sample features are easily measured whereas sample labels are unknown and costly to obtain. Accordingly, we devise a three-stage framework in service of performing an effective two-sample test with only a small number of sample label queries: first, a classifier is trained with samples uniformly labeled to model the posterior probabilities of the labels; second, a novel query scheme dubbed \emph{bimodal query} is used to query labels of samples from both classes, and last, the classical Friedman-Rafsky (FR) two-sample test is performed on the queried samples. Theoretical analysis and extensive experiments performed on several datasets demonstrate that the proposed test controls the Type I error and has decreased Type II error relative to uniform querying and certainty-based querying. Source code for our algorithms and experimental results is available at \url{https://github.com/wayne0908/Label-Efficient-Two-Sample}.
    Leveraging Global Binary Masks for Structure Segmentation in Medical Images. (arXiv:2205.09107v1 [eess.IV])
    Deep learning (DL) models for medical image segmentation are highly influenced by intensity variations of input images and lack generalization due to primarily utilizing pixels' intensity information for inference. Acquiring sufficient training data is another challenge limiting models' applications. We proposed to leverage the consistency of organs' anatomical shape and position information in medical images. We introduced a framework leveraging recurring anatomical patterns through global binary masks for organ segmentation. Two scenarios were studied.1) Global binary masks were the only model's (i.e. U-Net) input, forcing exclusively encoding organs' position and shape information for segmentation/localization.2) Global binary masks were incorporated as an additional channel functioning as position/shape clues to mitigate training data scarcity. Two datasets of the brain and heart CT images with their ground-truth were split into (26:10:10) and (12:3:5) for training, validation, and test respectively. Training exclusively on global binary masks led to Dice scores of 0.77(0.06) and 0.85(0.04), with the average Euclidian distance of 3.12(1.43)mm and 2.5(0.93)mm relative to the center of mass of the ground truth for the brain and heart structures respectively. The outcomes indicate that a surprising degree of position and shape information is encoded through global binary masks. Incorporating global binary masks led to significantly higher accuracy relative to the model trained on only CT images in small subsets of training data; the performance improved by 4.3-125.3% and 1.3-48.1% for 1-8 training cases of the brain and heart datasets respectively. The findings imply the advantages of utilizing global binary masks for building generalizable models and to compensate for training data scarcity.
    Dynamic Predictions of Postoperative Complications from Explainable, Uncertainty-Aware, and Multi-Task Deep Neural Networks. (arXiv:2004.12551v2 [cs.LG] UPDATED)
    Accurate prediction of postoperative complications can inform shared decisions regarding prognosis, preoperative risk-reduction, and postoperative resource use. We hypothesized that multi-task deep learning models would outperform random forest models in predicting postoperative complications, and that integrating high-resolution intraoperative physiological time series would result in more granular and personalized health representations that would improve prognostication compared to preoperative predictions. In a longitudinal cohort study of 56,242 patients undergoing 67,481 inpatient surgical procedures at a university medical center, we compared deep learning models with random forests for predicting nine common postoperative complications using preoperative, intraoperative, and perioperative patient data. Our study indicated several significant results across experimental settings that suggest the utility of deep learning for capturing more precise representations of patient health for augmented surgical decision support. Multi-task learning improved efficiency by reducing computational resources without compromising predictive performance. Integrated gradients interpretability mechanisms identified potentially modifiable risk factors for each complication. Monte Carlo dropout methods provided a quantitative measure of prediction uncertainty that has the potential to enhance clinical trust. Multi-task learning, interpretability mechanisms, and uncertainty metrics demonstrated potential to facilitate effective clinical implementation.
    Detecting micro fractures: A comprehensive comparison of conventional and machine-learning based segmentation methods. (arXiv:2103.12821v2 [cs.LG] UPDATED)
    Studying porous rock materials with X-Ray Computed Tomography (XRCT) has been established as a standard procedure for the non-destructive visualization of flow and transport in opaque porous media. Despite the recent advances in the field of XRCT, some challenges still remain due to the inherent noise and imaging artefacts in the produced data. These issues become even more profound when the objective is the identification of fractures, and/or fracture networks. The challenge is the limited contrast between the regions of interest and the neighboring areas. This limited contrast can mostly be attributed to the minute aperture of the fractures. In order to overcome this challenge, it has been a common approach to apply digital image processing, such as filtering, to enhance the signal-to-noise ratio. Additionally, segmentation methods based on threshold-/morphology schemes can be employed to obtain enhanced information from the features of interest. However, this workflow needs a skillful operator to fine-tune its input parameters, and the required computation time significantly increases due to the complexity of the available methods, and the large volume of the data-set. In this study, based on a data-set produced by the successful visualization of a fracture network in Carrara marble with XRCT, we present the segmentation results from a number of segmentation methods. Three conventional and two machine-learning-based methods are evaluated. The segmentation results from all five methods are compared to each other in terms of segmentation quality and time efficiency. Due to memory limitations, and in order to accomplish a fair comparison, all the methods are employed in a 2D scheme. The output of the 2D U-net model, which is one of the adopted machine-learning-based segmentation methods, shows the best performance regarding the quality of segmentation and the required processing time.
    Finite-Bit Quantization For Distributed Algorithms With Linear Convergence. (arXiv:2107.11304v3 [math.OC] UPDATED)
    This paper studies distributed algorithms for (strongly convex) composite optimization problems over mesh networks, subject to quantized communications. Instead of focusing on a specific algorithmic design, a black-box model is proposed, casting linearly convergent distributed algorithms in the form of fixed-point iterates. The algorithmic model is equipped with a novel random or deterministic Biased Compression (BC) rule on the quantizer design, and a new Adaptive encoding Nonuniform Quantizer (ANQ) coupled with a communication-efficient encoding scheme, which implements the BC-rule using a finite number of bits (below machine precision). This fills a gap existing in most state-of-the-art quantization schemes, such as those based on the popular compression rule, which rely on communication of some scalar signals with negligible quantization error (in practice quantized at the machine precision). A unified communication complexity analysis is developed for the black-box model, determining the average number of bits required to reach a solution of the optimization problem within a target accuracy. It is shown that the proposed BC-rule preserves linear convergence of the unquantized algorithms, and a trade-off between convergence rate and communication cost under ANQ-based quantization is characterized. Numerical results validate our theoretical findings and show that distributed algorithms equipped with the proposed ANQ have more favorable communication cost than algorithms using state-of-the-art quantization rules.
    Optimizing Operating Points for High Performance Lesion Detection and Segmentation Using Lesion Size Reweighting. (arXiv:2107.12978v2 [eess.IV] UPDATED)
    There are many clinical contexts which require accurate detection and segmentation of all focal pathologies (e.g. lesions, tumours) in patient images. In cases where there are a mix of small and large lesions, standard binary cross entropy loss will result in better segmentation of large lesions at the expense of missing small ones. Adjusting the operating point to accurately detect all lesions generally leads to oversegmentation of large lesions. In this work, we propose a novel reweighing strategy to eliminate this performance gap, increasing small pathology detection performance while maintaining segmentation accuracy. We show that our reweighing strategy vastly outperforms competing strategies based on experiments on a large scale, multi-scanner, multi-center dataset of Multiple Sclerosis patient images.
    Testing the Robustness of a BiLSTM-based Structural Story Classifier. (arXiv:2201.02733v2 [cs.CL] UPDATED)
    The growing prevalence of counterfeit stories on the internet has fostered significant interest towards fast and scalable detection of fake news in the machine learning community. While several machine learning techniques for this purpose have emerged, we observe that there is a need to evaluate the impact of noise on these techniques' performance, where noise constitutes news articles being mistakenly labeled as fake (or real). This work takes a step in that direction, where we examine the impact of noise on a state-of-the-art, structural model based on BiLSTM (Bidirectional Long-Short Term Model) for fake news detection, Hierarchical Discourse-level Structure for Fake News Detection by Karimi and Tang (Reference no. 9).
    Linear Speedup in Personalized Collaborative Learning. (arXiv:2111.05968v3 [cs.LG] UPDATED)
    Collaborative training can improve the accuracy of a model for a user by trading off the model's bias (introduced by using data from other users who are potentially different) against its variance (due to the limited amount of data on any single user). In this work, we formalize the personalized collaborative learning problem as a stochastic optimization of a task $0$ while given access to $N$ related but different tasks $1,\dots, N$. We give convergence guarantees for two algorithms in this setting -- a popular collaboration method known as \emph{weighted gradient averaging}, and a novel \emph{bias correction} method -- and explore conditions under which we can achieve linear speedup w.r.t. the number of auxiliary tasks $N$. Further, we also empirically study their performance confirming our theoretical insights.
    Learning Selective Sensor Fusion for States Estimation. (arXiv:1912.13077v2 [cs.CV] UPDATED)
    Autonomous vehicles and mobile robotic systems are typically equipped with multiple sensors to provide redundancy. By integrating the observations from different sensors, these mobile agents are able to perceive the environment and estimate system states, e.g. locations and orientations. Although deep learning approaches for multimodal odometry estimation and localization have gained traction, they rarely focus on the issue of robust sensor fusion - a necessary consideration to deal with noisy or incomplete sensor observations in the real world. Moreover, current deep odometry models suffer from a lack of interpretability. To this extent, we propose SelectFusion, an end-to-end selective sensor fusion module which can be applied to useful pairs of sensor modalities such as monocular images and inertial measurements, depth images and LIDAR point clouds. Our model is a uniform framework that is not restricted to specific modality or task. During prediction, the network is able to assess the reliability of the latent features from different sensor modalities and estimate trajectory both at scale and global pose. In particular, we propose two fusion modules - a deterministic soft fusion and a stochastic hard fusion, and offer a comprehensive study of the new strategies compared to trivial direct fusion. We extensively evaluate all fusion strategies in both public datasets and on progressively degraded datasets that present synthetic occlusions, noisy and missing data and time misalignment between sensors, and we investigate the effectiveness of the different fusion strategies in attending the most reliable features, which in itself, provides insights into the operation of the various models.
    Increasing-Margin Adversarial (IMA) Training to Improve Adversarial Robustness of Neural Networks. (arXiv:2005.09147v8 [cs.CV] UPDATED)
    Deep neural networks (DNNs) are vulnerable to adversarial noises. By adding adversarial noises to training samples, adversarial training can improve the model's robustness against adversarial noises. However, adversarial training samples with excessive noises can harm standard accuracy, which may be unacceptable for many medical image analysis applications. This issue has been termed the trade-off between standard accuracy and adversarial robustness. In this paper, we hypothesize that this issue may be alleviated if the adversarial samples for training are placed right on the decision boundaries. Based on this hypothesis, we design an adaptive adversarial training method, named IMA. For each individual training sample, IMA makes a sample-wise estimation of the upper bound of the adversarial perturbation. In the training process, each of the sample-wise adversarial perturbations is gradually increased to match the margin. Once an equilibrium state is reached, the adversarial perturbations will stop increasing. IMA is evaluated on publicly available datasets under two popular adversarial attacks, PGD and IFGSM. The results show that: (1) IMA significantly improves adversarial robustness of DNN classifiers, which achieves the state-of-the-art performance; (2) IMA has a minimal reduction in clean accuracy among all competing defense methods; (3) IMA can be applied to pretrained models to reduce time cost; (4) IMA can be applied to the state-of-the-art medical image segmentation networks, with outstanding performance. We hope our work may help to lift the trade-off between adversarial robustness and clean accuracy and facilitate the development of robust applications in the medical field. The source code will be released when this paper is published.
    Physics-informed Guided Disentanglement in Generative Networks. (arXiv:2107.14229v2 [cs.CV] UPDATED)
    Image-to-image translation (i2i) networks suffer from entanglement effects in presence of physics-related phenomena in target domain (such as occlusions, fog, etc), lowering altogether the translation quality, controllability and variability. In this paper, we build upon collection of simple physics models and present a comprehensive method for disentangling visual traits in target images, guiding the process with a physical model that renders some of the target traits, and learning the remaining ones. Because it allows explicit and interpretable outputs, our physical models (optimally regressed on target) allows generating unseen scenarios in a controllable manner. We also extend our framework, showing versatility to neural-guided disentanglement. The results show our disentanglement strategies dramatically increase performances qualitatively and quantitatively in several challenging scenarios for image translation.
    Link Scheduling using Graph Neural Networks. (arXiv:2109.05536v2 [eess.SP] UPDATED)
    Efficient scheduling of transmissions is a key problem in wireless networks. The main challenge stems from the fact that optimal link scheduling involves solving a maximum weighted independent set (MWIS) problem, which is known to be NP-hard. In practical schedulers, centralized and distributed greedy heuristics are commonly used to approximately solve the MWIS problem. However, these greedy heuristics mostly ignore important topological information of the wireless network. To overcome this limitation, we propose fast heuristics based on graph convolutional networks (GCNs) that can be implemented in centralized and distributed manners. Our centralized heuristic is based on tree search guided by a GCN and 1-step rollout. In our distributed MWIS solver, a GCN generates topology-aware node embeddings that are combined with per-link utilities before invoking a distributed greedy solver. Moreover, a novel reinforcement learning scheme is developed to train the GCN in a non-differentiable pipeline. Test results on medium-sized wireless networks show that our centralized heuristic can reach a near-optimal solution quickly, and our distributed heuristic based on a shallow GCN can reduce by nearly half the suboptimality gap of the distributed greedy solver with minimal increase in complexity. The proposed schedulers also exhibit good generalizability across graph and weight distributions.
    PocketNet: A Smaller Neural Network for Medical Image Analysis. (arXiv:2104.10745v3 [eess.IV] UPDATED)
    Medical imaging deep learning models are often large and complex, requiring specialized hardware to train and evaluate these models. To address such issues, we propose the PocketNet paradigm to reduce the size of deep learning models by throttling the growth of the number of channels in convolutional neural networks. We demonstrate that, for a range of segmentation and classification tasks, PocketNet architectures produce results comparable to that of conventional neural networks while reducing the number of parameters by multiple orders of magnitude, using up to 90% less GPU memory, and speeding up training times by up to 40%, thereby allowing such models to be trained and deployed in resource-constrained settings.
    Cohort Bias Adaptation in Aggregated Datasets for Lesion Segmentation. (arXiv:2108.00713v2 [eess.IV] UPDATED)
    Many automatic machine learning models developed for focal pathology (e.g. lesions, tumours) detection and segmentation perform well, but do not generalize as well to new patient cohorts, impeding their widespread adoption into real clinical contexts. One strategy to create a more diverse, generalizable training set is to naively pool datasets from different cohorts. Surprisingly, training on this \it{big data} does not necessarily increase, and may even reduce, overall performance and model generalizability, due to the existence of cohort biases that affect label distributions. In this paper, we propose a generalized affine conditioning framework to learn and account for cohort biases across multi-source datasets, which we call Source-Conditioned Instance Normalization (SCIN). Through extensive experimentation on three different, large scale, multi-scanner, multi-centre Multiple Sclerosis (MS) clinical trial MRI datasets, we show that our cohort bias adaptation method (1) improves performance of the network on pooled datasets relative to naively pooling datasets and (2) can quickly adapt to a new cohort by fine-tuning the instance normalization parameters, thus learning the new cohort bias with only 10 labelled samples.
    Assisted Learning for Organizations with Limited Data. (arXiv:2109.09307v3 [cs.LG] UPDATED)
    We develop an assisted learning framework for assisting organization-level learners to improve their learning performance with limited and imbalanced data. In particular, learners at the organization level usually have sufficient computation resource, but are subject to stringent collaboration policy and information privacy. Their limited imbalanced data often cause biased inference and sub-optimal decision-making. In our assisted learning framework, an organizational learner purchases assistance service from a service provider and aims to enhance its model performance within a few assistance rounds. We develop effective stochastic training algorithms for assisted deep learning and assisted reinforcement learning. Different from existing distributed algorithms that need to frequently transmit gradients or models, our framework allows the learner to only occasionally share information with the service provider, and still achieve a near-oracle model as if all the data were centralized.
    Masked Autoencoders As Spatiotemporal Learners. (arXiv:2205.09113v1 [cs.CV])
    This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs the best. We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. We report competitive results on several challenging video datasets using vanilla Vision Transformers. We observe that MAE can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge.
    Efficient PAC Reinforcement Learning in Regular Decision Processes. (arXiv:2105.06784v3 [cs.AI] UPDATED)
    Recently regular decision processes have been proposed as a well-behaved form of non-Markov decision process. Regular decision processes are characterised by a transition function and a reward function that depend on the whole history, though regularly (as in regular languages). In practice both the transition and the reward functions can be seen as finite transducers. We study reinforcement learning in regular decision processes. Our main contribution is to show that a near-optimal policy can be PAC-learned in polynomial time in a set of parameters that describe the underlying decision process. We argue that the identified set of parameters is minimal and it reasonably captures the difficulty of a regular decision process.
    Phy-Q: A Testbed for Physical Reasoning. (arXiv:2108.13696v2 [cs.AI] UPDATED)
    Humans are well-versed in reasoning about the behaviors of physical objects and choosing actions accordingly to accomplish tasks, while it remains a major challenge for AI. To facilitate research addressing this problem, we propose a new testbed that requires an agent to reason about physical scenarios and take an action appropriately. Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, we identify 15 essential physical scenarios. For each scenario, we create a wide variety of distinct task templates, and we ensure all the task templates within the same scenario can be solved by using one specific strategic physical rule. By having such a design, we evaluate two distinct levels of generalization, namely the local generalization and the broad generalization. We conduct an extensive evaluation with human players, learning agents with varying input types and architectures, and heuristic agents with different strategies. Inspired by how human IQ is calculated, we define the physical reasoning quotient (Phy-Q score) that reflects the physical reasoning intelligence of an agent. Our evaluation shows that 1) all agents are far below human performance, and 2) learning agents, even with good local generalization ability, struggle to learn the underlying physical reasoning rules and fail to generalize broadly. We encourage the development of intelligent agents that can reach the human level Phy-Q score. Website: https://github.com/phy-q/benchmark
    GSN: A Graph Neural Network Inspired by Spring Network. (arXiv:2201.12994v3 [cs.LG] UPDATED)
    The design of Graph Neural Networks (GNNs) that operate on both homophilous and heterophilous graphs has received research attention in recent years. Existing heterophilous GNNs, particularly those designed in the spatial domain, lack a convincing theoretical or physical motivation. Inspired by an old-fashioned spring network model, we propose the Graph Spring Network (GSN), a universal GNN model that works for homophilous and heterophilous graphs. We show that the GSN framework can interpret many GNN models from the perspective of potential energy minimization of a spring network with respect to various metrics, which entrusts strong physical motivations to these models. We also conduct experiments to demonstrate the performance of our GSN model on real-world datasets.
    Single-Shot Optical Neural Network. (arXiv:2205.09103v1 [cs.ET])
    As deep neural networks (DNNs) grow to solve increasingly complex problems, they are becoming limited by the latency and power consumption of existing digital processors. 'Weight-stationary' analog optical and electronic hardware has been proposed to reduce the compute resources required by DNNs by eliminating expensive weight updates; however, with scalability limited to an input vector length $K$ of hundreds of elements. Here, we present a scalable, single-shot-per-layer weight-stationary optical processor that leverages the advantages of free-space optics for passive optical copying and large-scale distribution of an input vector and integrated optoelectronics for static, reconfigurable weighting and the nonlinearity. We propose an optimized near-term CMOS-compatible system with $K = 1,000$ and beyond, and we calculate its theoretical total latency ($\sim$10 ns), energy consumption ($\sim$10 fJ/MAC) and throughput ($\sim$petaMAC/s) per layer. We also experimentally test DNN classification accuracy with single-shot analog optical encoding, copying and weighting of the MNIST handwritten digit dataset in a proof-of-concept system, achieving 94.7% (similar to the ground truth accuracy of 96.3%) without retraining on the hardware or data preprocessing. Lastly, we determine the upper bound on throughput of our system ($\sim$0.9 exaMAC/s), set by the maximum optical bandwidth before significant loss of accuracy. This joint use of wide spectral and spatial bandwidths enables highly efficient computing for next-generation DNNs.
    Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement. (arXiv:1810.09103v3 [cs.LG] UPDATED)
    Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the action-values, with the addition of entropy regularization for soft variants. In this work, we explore an alternative update for the actor, based on an extension of the cross entropy method (CEM) to condition on inputs (states). The idea is to start with a broader policy and slowly concentrate around maximal actions, using a maximum likelihood update towards actions in the top percentile per state. The speed of this concentration is controlled by a proposal policy, that concentrates at a slower rate than the actor. We first provide a policy improvement result in an idealized setting, and then prove that our conditional CEM (CCEM) strategy tracks a CEM update per state, even with changing action-values. We empirically show that our Greedy AC algorithm, that uses CCEM for the actor update, performs better than Soft AC and is much less sensitive to entropy-regularization.
    On the Efficiency of Entropic Regularized Algorithms for Optimal Transport. (arXiv:1906.01437v9 [cs.DS] UPDATED)
    We present several new complexity results for the entropic regularized algorithms that approximately solve the optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. First, we improve the complexity bound of a greedy variant of Sinkhorn, known as \textit{Greenkhorn}, from $\widetilde{O}(n^2\varepsilon^{-3})$ to $\widetilde{O}(n^2\varepsilon^{-2})$. Notably, our result can match the best known complexity bound of Sinkhorn and help clarify why Greenkhorn significantly outperforms Sinkhorn in practice in terms of row/column updates as observed by~\citet{Altschuler-2017-Near}. Second, we propose a new algorithm, which we refer to as \textit{APDAMD} and which generalizes an adaptive primal-dual accelerated gradient descent (APDAGD) algorithm~\citep{Dvurechensky-2018-Computational} with a prespecified mirror mapping $\phi$. We prove that APDAMD achieves the complexity bound of $\widetilde{O}(n^2\sqrt{\delta}\varepsilon^{-1})$ in which $\delta>0$ stands for the regularity of $\phi$. In addition, we show by a counterexample that the complexity bound of $\widetilde{O}(\min\{n^{9/4}\varepsilon^{-1}, n^2\varepsilon^{-2}\})$ proved for APDAGD before is invalid and give a refined complexity bound of $\widetilde{O}(n^{5/2}\varepsilon^{-1})$. Further, we develop a \textit{deterministic} accelerated variant of Sinkhorn via appeal to estimated sequence and prove the complexity bound of $\widetilde{O}(n^{7/3}\varepsilon^{-4/3})$. As such, we see that accelerated variant of Sinkhorn outperforms Sinkhorn and Greenkhorn in terms of $1/\varepsilon$ and APDAGD and accelerated alternating minimization (AAM)~\citep{Guminov-2021-Combination} in terms of $n$. Finally, we conduct the experiments on synthetic and real data and the numerical results show the efficiency of Greenkhorn, APDAMD and accelerated Sinkhorn in practice.
    Representation Learning for Content-Sensitive Anomaly Detection in Industrial Networks. (arXiv:2205.08953v1 [cs.LG])
    Using a convGRU-based autoencoder, this thesis proposes a framework to learn spatial-temporal aspects of raw network traffic in an unsupervised and protocol-agnostic manner. The learned representations are used to measure the effect on the results of a subsequent anomaly detection and are compared to the application without the extracted features. The evaluation showed, that the anomaly detection could not effectively be enhanced when applied on compressed traffic fragments for the context of network intrusion detection. Yet, the trained autoencoder successfully generates a compressed representation (code) of the network traffic, which hold spatial and temporal information. Based on the models residual loss, the autoencoder is also capable of detecting anomalies by itself. Lastly, an approach for a kind of model interpretability (LRP) was investigated in order to identify relevant areas within the raw input data, which is used to enrich alerts generated by an anomaly detection method.
    Meta-Learning Sparse Compression Networks. (arXiv:2205.08957v1 [stat.ML])
    Recent work in Deep Learning has re-imagined the representation of data as functions mapping from a coordinate space to an underlying continuous signal. When such functions are approximated by neural networks this introduces a compelling alternative to the more common multi-dimensional array representation. Recent work on such Implicit Neural Representations (INRs) has shown that - following careful architecture search - INRs can outperform established compression methods such as JPEG (e.g. Dupont et al., 2021). In this paper, we propose crucial steps towards making such ideas scalable: Firstly, we employ stateof-the-art network sparsification techniques to drastically improve compression. Secondly, introduce the first method allowing for sparsification to be employed in the inner-loop of commonly used Meta-Learning algorithms, drastically improving both compression and the computational cost of learning INRs. The generality of this formalism allows us to present results on diverse data modalities such as images, manifolds, signed distance functions, 3D shapes and scenes, several of which establish new state-of-the-art results.
    A weakly supervised framework for high-resolution crop yield forecasts. (arXiv:2205.09016v1 [cs.LG])
    Predictor inputs and label data for crop yield forecasting are not always available at the same spatial resolution. We propose a deep learning framework that uses high resolution inputs and low resolution labels to produce crop yield forecasts for both spatial levels. The forecasting model is calibrated by weak supervision from low resolution crop area and yield statistics. We evaluated the framework by disaggregating regional yields in Europe from parent statistical regions to sub-regions for five countries (Germany, Spain, France, Hungary, Italy) and two crops (soft wheat and potatoes). Performance of weakly supervised models was compared with linear trend models and Gradient-Boosted Decision Trees (GBDT). Higher resolution crop yield forecasts are useful to policymakers and other stakeholders. Weakly supervised deep learning methods provide a way to produce such forecasts even in the absence of high resolution yield data.
    Medical Deep Learning -- A systematic Meta-Review. (arXiv:2010.14881v5 [eess.IV] UPDATED)
    Deep learning (DL) has remarkably impacted several different scientific disciplines over the last few years. E.g., in image processing and analysis, DL algorithms were able to outperform other cutting-edge methods. Additionally, DL has delivered state-of-the-art results in tasks like autonomous driving, outclassing previous attempts. There are even instances where DL outperformed humans, for example with object recognition and gaming. DL is also showing vast potential in the medical domain. With the collection of large quantities of patient records and data, and a trend towards personalized treatments, there is a great need for automated and reliable processing and analysis of health information. Patient data is not only collected in clinical centers, like hospitals and private practices, but also by mobile healthcare apps or online websites. The abundance of collected patient data and the recent growth in the DL field has resulted in a large increase in research efforts. In Q2/2020, the search engine PubMed returned already over 11,000 results for the search term 'deep learning', and around 90% of these publications are from the last three years. However, even though PubMed represents the largest search engine in the medical field, it does not cover all medical-related publications. Hence, a complete overview of the field of 'medical deep learning' is almost impossible to obtain and acquiring a full overview of medical sub-fields is becoming increasingly more difficult. Nevertheless, several review and survey articles about medical DL have been published within the last few years. They focus, in general, on specific medical scenarios, like the analysis of medical images containing specific pathologies. With these surveys as a foundation, the aim of this article is to provide the first high-level, systematic meta-review of medical DL surveys.
    The Kernelized Taylor Diagram. (arXiv:2205.08864v1 [stat.ML])
    This paper presents the kernelized Taylor diagram, a graphical framework for visualizing similarities between data populations. The kernelized Taylor diagram builds on the widely used Taylor diagram, which is used to visualize similarities between populations. However, the Taylor diagram has several limitations such as not capturing non-linear relationships and sensitivity to outliers. To address such limitations, we propose the kernelized Taylor diagram. Our proposed kernelized Taylor diagram is capable of visualizing similarities between populations with minimal assumptions of the data distributions. The kernelized Taylor diagram relates the maximum mean discrepancy and the kernel mean embedding in a single diagram, a construction that, to the best of our knowledge, have not been devised prior to this work. We believe that the kernelized Taylor diagram can be a valuable tool in data visualization.
    FiLM: Frequency improved Legendre Memory Model for Long-term Time Series Forecasting. (arXiv:2205.08897v1 [cs.LG])
    Recent studies have shown the promising performance of deep learning models (e.g., RNN and Transformer) for long-term time series forecasting. These studies mostly focus on designing deep models to effectively combine historical information for long-term forecasting. However, the question of how to effectively represent historical information for long-term forecasting has not received enough attention, limiting our capacity to exploit powerful deep learning models. The main challenge in time series representation is how to handle the dilemma between accurately preserving historical information and reducing the impact of noisy signals in the past. To this end, we design a \textbf{F}requency \textbf{i}mproved \textbf{L}egendre \textbf{M}emory model, or {\bf FiLM} for short: it introduces Legendre Polynomial projections to preserve historical information accurately and Fourier projections plus low-rank approximation to remove noisy signals. Our empirical studies show that the proposed FiLM improves the accuracy of state-of-the-art models by a significant margin (\textbf{19.2\%}, \textbf{22.6\%}) in multivariate and univariate long-term forecasting, respectively. In addition, dimensionality reduction introduced by low-rank approximation leads to a dramatic improvement in computational efficiency. We also demonstrate that the representation module developed in this work can be used as a general plug-in to improve the performance of most deep learning modules for long-term forecasting. Code will be released soon
    Fast Neural Network based Solving of Partial Differential Equations. (arXiv:2205.08978v1 [cs.LG])
    We present a novel method for using Neural Networks (NNs) for finding solutions to a class of Partial Differential Equations (PDEs). Our method builds on recent advances in Neural Radiance Field research (NeRFs) and allows for a NN to converge to a PDE solution much faster than classic Physically Informed Neural Network (PINNs) approaches.
    Learning Shared Kernel Models: the Shared Kernel EM algorithm. (arXiv:2205.09041v1 [cs.LG])
    Expectation maximisation (EM) is an unsupervised learning method for estimating the parameters of a finite mixture distribution. It works by introducing "hidden" or "latent" variables via Baum's auxiliary function $Q$ that allow the joint data likelihood to be expressed as a product of simple factors. The relevance of EM has increased since the introduction of the variational lower bound (VLB): the VLB differs from Baum's auxiliary function only by the entropy of the PDF of the latent variables $Z$. We first present a rederivation of the standard EM algorithm using data association ideas from the field of multiple target tracking, using $K$-valued scalar data association hypotheses rather than the usual binary indicator vectors. The same method is then applied to a little known but much more general type of supervised EM algorithm for shared kernel models, related to probabilistic radial basis function networks. We address a number of shortcomings in the derivations that have been published previously in this area. In particular, we give theoretically rigorous derivations of (i) the complete data likelihood; (ii) Baum's auxiliary function (the E-step) and (iii) the maximisation (M-step) in the case of Gaussian shared kernel models. The subsequent algorithm, called shared kernel EM (SKEM), is then applied to a digit recognition problem using a novel 7-segment digit representation. Variants of the algorithm that use different numbers of features and different EM algorithm dimensions are compared in terms of mean accuracy and mean IoU. A simplified classifier is proposed that decomposes the joint data PDF as a product of lower order PDFs over non-overlapping subsets of variables. The effect of different numbers of assumed mixture components $K$ is also investigated. High-level source code for the data generation and SKEM algorithm is provided.
    Unsupervised Features Ranking via Coalitional Game Theory for Categorical Data. (arXiv:2205.09060v1 [cs.LG])
    Not all real-world data are labeled, and when labels are not available, it is often costly to obtain them. Moreover, as many algorithms suffer from the curse of dimensionality, reducing the features in the data to a smaller set is often of great utility. Unsupervised feature selection aims to reduce the number of features, often using feature importance scores to quantify the relevancy of single features to the task at hand. These scores can be based only on the distribution of variables and the quantification of their interactions. The previous literature, mainly investigating anomaly detection and clusters, fails to address the redundancy-elimination issue. We propose an evaluation of correlations among features to compute feature importance scores representing the contribution of single features in explaining the dataset's structure. Based on Coalitional Game Theory, our feature importance scores include a notion of redundancy awareness making them a tool to achieve redundancy-free feature selection. We show that the deriving features' selection outperforms competing methods in lowering the redundancy rate while maximizing the information contained in the data. We also introduce an approximated version of the algorithm to reduce the complexity of Shapley values' computations.
    Maslow's Hammer for Catastrophic Forgetting: Node Re-Use vs Node Activation. (arXiv:2205.09029v1 [stat.ML])
    Continual learning - learning new tasks in sequence while maintaining performance on old tasks - remains particularly challenging for artificial neural networks. Surprisingly, the amount of forgetting does not increase with the dissimilarity between the learned tasks, but appears to be worst in an intermediate similarity regime. In this paper we theoretically analyse both a synthetic teacher-student framework and a real data setup to provide an explanation of this phenomenon that we name Maslow's hammer hypothesis. Our analysis reveals the presence of a trade-off between node activation and node re-use that results in worst forgetting in the intermediate regime. Using this understanding we reinterpret popular algorithmic interventions for catastrophic interference in terms of this trade-off, and identify the regimes in which they are most effective.
    Moderately Supervised Learning: Definition, Framework and Generality. (arXiv:2008.11945v3 [cs.CV] UPDATED)
    Learning with supervision has achieved remarkable success in numerous artificial intelligence (AI) applications. In the current literature, by referring to the properties of the labels prepared for the training data set, learning with supervision is categorized as supervised learning (SL) and weakly supervised learning (WSL). SL concerns the situation where the training data set is assigned with ideal labels, while WSL concerns the situation where the training data set is assigned with non-ideal labels. However, without considering the properties of the transformation from the given labels to learnable targets, the definition of SL is relatively abstract, which conceals some details that can be critical to building the appropriate solutions for specific SL tasks. Thus, it is desirable to reveal these details more concretely. This article attempts to achieve this goal by expanding the categorization of SL and investigating the sub-type that plays the central role in SL. More specifically, taking into consideration the properties of the transformation from the given labels to learnable targets, we firstly categorize SL into three narrower sub-types. Then we focus on the moderately supervised learning (MSL) sub-type that concerns the situation where the given labels are ideal, but due to the simplicity in annotation, careful designs are required to transform the given labels into learnable targets. From the perspectives of the definition, framework and generality, we comprehensively illustrate MSL and reveal what details are concealed by the abstractness of the definition of SL. At the meantime, the whole presentation of this paper as well establishes a tutorial for AI application engineers to refer to viewing a problem to be solved from the mathematicians' vision.
    FedAdapt: Adaptive Offloading for IoT Devices in Federated Learning. (arXiv:2107.04271v5 [cs.DC] UPDATED)
    Applying Federated Learning (FL) on Internet-of-Things devices is necessitated by the large volumes of data they produce and growing concerns of data privacy. However, there are three challenges that need to be addressed to make FL efficient: (i) execution on devices with limited computational capabilities, (ii) accounting for stragglers due to computational heterogeneity of devices, and (iii) adaptation to the changing network bandwidths. This paper presents FedAdapt, an adaptive offloading FL framework to mitigate the aforementioned challenges. FedAdapt accelerates local training in computationally constrained devices by leveraging layer offloading of deep neural networks (DNNs) to servers. Further, FedAdapt adopts reinforcement learning based optimization and clustering to adaptively identify which layers of the DNN should be offloaded for each individual device on to a server to tackle the challenges of computational heterogeneity and changing network bandwidth. Experimental studies are carried out on a lab-based testbed and it is demonstrated that by offloading a DNN from the device to the server FedAdapt reduces the training time of a typical IoT device by over half compared to classic FL. The training time of extreme stragglers and the overall training time can be reduced by up to 57%. Furthermore, with changing network bandwidth, FedAdapt is demonstrated to reduce the training time by up to 40% when compared to classic FL, without sacrificing accuracy.
    SOUL: An Energy-Efficient Unsupervised Online Learning Seizure Detection Classifier. (arXiv:2110.02169v2 [eess.SP] UPDATED)
    Implantable devices that record neural activity and detect seizures have been adopted to issue warnings or trigger neurostimulation to suppress epileptic seizures. Typical seizure detection systems rely on high-accuracy offline-trained machine learning classifiers that require manual retraining when seizure patterns change over long periods of time. For an implantable seizure detection system, a low power, at-the-edge, online learning algorithm can be employed to dynamically adapt to the neural signal drifts, thereby maintaining high accuracy without external intervention. This work proposes SOUL: Stochastic-gradient-descent-based Online Unsupervised Logistic regression classifier. After an initial offline training phase, continuous online unsupervised classifier updates are applied in situ, which improves sensitivity in patients with drifting seizure features. SOUL was tested on two human electroencephalography (EEG) datasets: the CHB-MIT scalp EEG dataset, and a long (>100 hours) NeuroVista intracranial EEG dataset. It was able to achieve an average sensitivity of 97.5% and 97.9% for the two datasets respectively, at >95% specificity. Sensitivity improved by at most 8.2% on long-term data when compared to a typical seizure detection classifier. SOUL was fabricated in TSMC's 28 nm process occupying 0.1 mm2 and achieves 1.5 nJ/classification energy efficiency, which is at least 24x more efficient than state-of-the-art.
    SoQal: Selective Oracle Questioning for Consistency Based Active Learning of Cardiac Signals. (arXiv:2004.09557v3 [cs.LG] UPDATED)
    Clinical settings are often characterized by abundant unlabelled data and limited labelled data. This is typically driven by the high burden placed on oracles (e.g., physicians) to provide annotations. One way to mitigate this burden is via active learning (AL) which involves the (a) acquisition and (b) annotation of informative unlabelled instances. Whereas previous work addresses either one of these elements independently, we propose an AL framework that addresses both. For acquisition, we propose Bayesian Active Learning by Consistency (BALC), a sub-framework which perturbs both instances and network parameters and quantifies changes in the network output probability distribution. For annotation, we propose SoQal, a sub-framework that dynamically determines whether, for each acquired unlabelled instance, to request a label from an oracle or to pseudo-label it instead. We show that BALC can outperform start-of-the-art acquisition functions such as BALD, and SoQal outperforms baseline methods even in the presence of a noisy oracle.
    SoK: The Impact of Unlabelled Data in Cyberthreat Detection. (arXiv:2205.08944v1 [cs.CR])
    Machine learning (ML) has become an important paradigm for cyberthreat detection (CTD) in the recent years. A substantial research effort has been invested in the development of specialized algorithms for CTD tasks. From the operational perspective, however, the progress of ML-based CTD is hindered by the difficulty in obtaining the large sets of labelled data to train ML detectors. A potential solution to this problem are semisupervised learning (SsL) methods, which combine small labelled datasets with large amounts of unlabelled data. This paper is aimed at systematization of existing work on SsL for CTD and, in particular, on understanding the utility of unlabelled data in such systems. To this end, we analyze the cost of labelling in various CTD tasks and develop a formal cost model for SsL in this context. Building on this foundation, we formalize a set of requirements for evaluation of SsL methods, which elucidates the contribution of unlabelled data. We review the state-of-the-art and observe that no previous work meets such requirements. To address this problem, we propose a framework for assessing the benefits of unlabelled data in SsL. We showcase an application of this framework by performing the first benchmark evaluation that highlights the tradeoffs of 9 existing SsL methods on 9 public datasets. Our findings verify that, in some cases, unlabelled data provides a small, but statistically significant, performance gain. This paper highlights that SsL in CTD has a lot of room for improvement, which should stimulate future research in this field.
    Entity Alignment with Reliable Path Reasoning and Relation-Aware Heterogeneous Graph Transformer. (arXiv:2205.08806v1 [cs.CL])
    Entity Alignment (EA) has attracted widespread attention in both academia and industry, which aims to seek entities with same meanings from different Knowledge Graphs (KGs). There are substantial multi-step relation paths between entities in KGs, indicating the semantic relations of entities. However, existing methods rarely consider path information because not all natural paths facilitate for EA judgment. In this paper, we propose a more effective entity alignment framework, RPR-RHGT, which integrates relation and path structure information, as well as the heterogeneous information in KGs. Impressively, an initial reliable path reasoning algorithm is developed to generate the paths favorable for EA task from the relation structures of KGs, which is the first algorithm in the literature to successfully use unrestricted path information. In addition, to efficiently capture heterogeneous features in entity neighborhoods, a relation-aware heterogeneous graph transformer is designed to model the relation and path structures of KGs. Extensive experiments on three well-known datasets show RPR-RHGT significantly outperforms 11 state-of-the-art methods, exceeding the best performing baseline up to 8.62% on Hits@1. We also show its better performance than the baselines on different ratios of training set, and harder datasets.
    Automating In-Network Machine Learning. (arXiv:2205.08824v1 [cs.NI])
    Using programmable network devices to aid in-network machine learning has been the focus of significant research. However, most of the research was of a limited scope, providing a proof of concept or describing a closed-source algorithm. To date, no general solution has been provided for mapping machine learning algorithms to programmable network devices. In this paper, we present Planter, an open-source, modular framework for mapping trained machine learning models to programmable devices. Planter supports a wide range of machine learning models, multiple targets and can be easily extended. The evaluation of Planter compares different mapping approaches, and demonstrates the feasibility, performance, and resource efficiency for applications such as anomaly detection, financial transactions, and quality of experience. The results show that Planter-based in-network machine learning algorithms can run at line rate, have a negligible effect on latency, coexist with standard switching functionality, and have no or minor accuracy trade-offs.
    DL4DS -- Deep Learning for empirical DownScaling. (arXiv:2205.08967v1 [cs.LG])
    A common task in Earth Sciences is to infer climate information at local and regional scales from global climate models. Dynamical downscaling requires running expensive numerical models at high resolution which can be prohibitive due to long model runtimes. On the other hand, statistical downscaling techniques present an alternative approach for learning links between the large- and local-scale climate in a more efficient way. A large number of deep neural network-based approaches for statistical downscaling have been proposed in recent years, mostly based on convolutional architectures developed for computer vision and super-resolution tasks. This paper presents DL4DS, Deep Learning for empirical DownScaling, a python library that implements a wide variety of state-of-the-art and novel algorithms for downscaling gridded Earth Science data with deep neural networks. DL4DS has been designed with the goal of providing a general framework for training convolutional neural networks with configurable architectures and learning strategies to facilitate the conduction of comparative and ablation studies in a robust way. We showcase the capabilities of DL4DS on air quality CAMS data over the western Mediterranean area. The DL4DS library can be found in this repository: https://github.com/carlos-gg/dl4ds
    One Explanation to Rule them All -- Ensemble Consistent Explanations. (arXiv:2205.08974v1 [cs.AI])
    Transparency is a major requirement of modern AI based decision making systems deployed in real world. A popular approach for achieving transparency is by means of explanations. A wide variety of different explanations have been proposed for single decision making systems. In practice it is often the case to have a set (i.e. ensemble) of decisions that are used instead of a single decision only, in particular in complex systems. Unfortunately, explanation methods for single decision making systems are not easily applicable to ensembles -- i.e. they would yield an ensemble of individual explanations which are not necessarily consistent, hence less useful and more difficult to understand than a single consistent explanation of all observed phenomena. We propose a novel concept for consistently explaining an ensemble of decisions locally with a single explanation -- we introduce a formal concept, as well as a specific implementation using counterfactual explanations.
    Price Interpretability of Prediction Markets: A Convergence Analysis. (arXiv:2205.08913v1 [q-fin.TR])
    Prediction markets are long known for prediction accuracy. However, there is still a lack of systematic understanding of how prediction markets aggregate information and why they work so well. This work proposes a multivariate utility (MU)-based mechanism that unifies several existing prediction market-making schemes. Based on this mechanism, we derive convergence results for markets with myopic, risk-averse traders who repeatedly interact with the market maker. We show that the resulting limiting wealth distribution lies on the Pareto efficient frontier defined by all market participants' utilities. With the help of this result, we establish both analytical and numerical results for the limiting price for different market models. We show that the limiting price converges to the geometric mean of agents' beliefs for exponential utility-based markets. For risk measure-based markets, we construct a risk measure family that meets the convergence requirements and show that the limiting price can converge to a weighted power mean of agent beliefs. For markets based on hyperbolic absolute risk aversion (HARA) utilities, we show that the limiting price is also a risk-adjusted weighted power mean of agent beliefs, even though the trading order will affect the aggregation weights. We further propose an approximation scheme for the limiting price under the HARA utility family. We show through numerical experiments that our approximation scheme works well in predicting the convergent prices.
    POViT: Vision Transformer for Multi-objective Design and Characterization of Nanophotonic Devices. (arXiv:2205.09045v1 [cs.LG])
    We solve a fundamental challenge in semiconductor IC design: the fast and accurate characterization of nanoscale photonic devices. Much like the fusion between AI and EDA, many efforts have been made to apply DNNs such as convolutional neural networks (CNN) to prototype and characterize next-gen optoelectronic devices commonly found in photonic integrated circuits (PIC) and LiDAR. These prior works generally strive to predict the quality factor (Q) and modal volume (V) of for instance, photonic crystals, with ultra-high accuracy and speed. However, state-of-the-art models are still far from being directly applicable in the real-world: e.g. the correlation coefficient of V ($V_{coeff}$ ) is only about 80%, which is much lower than what it takes to generate reliable and reproducible nanophotonic designs. Recently, attention-based transformer models have attracted extensive interests and been widely used in CV and NLP. In this work, we propose the first-ever Transformer model (POViT) to efficiently design and simulate semiconductor photonic devices with multiple objectives. Unlike the standard Vision Transformer (ViT), we supplied photonic crystals as data input and changed the activation layer from GELU to an absolute-value function (ABS). Our experiments show that POViT exceeds results reported by previous models significantly. The correlation coefficient $V_{coeff}$ increases by over 12% (i.e., to 92.0%) and the prediction errors of Q is reduced by an order of magnitude, among several other key metric improvements. Our work has the potential to drive the expansion of EDA to fully automated photonic design. The complete dataset and code will be released to aid researchers endeavoring in the interdisciplinary field of physics and computer science.
    Slowly Changing Adversarial Bandit Algorithms are Provably Efficient for Discounted MDPs. (arXiv:2205.09056v1 [cs.LG])
    Reinforcement learning (RL) generalizes bandit problems with additional difficulties on longer planning horzion and unknown transition kernel. We show that, under some mild assumptions, \textbf{any} slowly changing adversarial bandit algorithm enjoys near-optimal regret in adversarial bandits can achieve near-optimal (expected) regret in non-episodic discounted MDPs. The slowly changing property required by our generalization is mild, see e.g. (Even-Dar et al. 2009, Neu et al. 2010), we also show, for example, \expt~(Auer et al. 2002) is slowly changing and enjoys near-optimal regret in MDPs.
    Multilayer Perceptron Based Stress Evolution Analysis under DC Current Stressing for Multi-segment Wires. (arXiv:2205.09065v1 [cs.LG])
    Electromigration (EM) is one of the major concerns in the reliability analysis of very large scale integration (VLSI) systems due to the continuous technology scaling. Accurately predicting the time-to-failure of integrated circuits (IC) becomes increasingly important for modern IC design. However, traditional methods are often not sufficiently accurate, leading to undesirable over-design especially in advanced technology nodes. In this paper, we propose an approach using multilayer perceptrons (MLP) to compute stress evolution in the interconnect trees during the void nucleation phase. The availability of a customized trial function for neural network training holds the promise of finding dynamic mesh-free stress evolution on complex interconnect trees under time-varying temperatures. Specifically, we formulate a new objective function considering the EM-induced coupled partial differential equations (PDEs), boundary conditions (BCs), and initial conditions to enforce the physics-based constraints in the spatial-temporal domain. The proposed model avoids meshing and reduces temporal iterations compared with conventional numerical approaches like FEM. Numerical results confirm its advantages on accuracy and computational performance.
    Multi-disciplinary fairness considerations in machine learning for clinical trials. (arXiv:2205.08875v1 [cs.LG])
    While interest in the application of machine learning to improve healthcare has grown tremendously in recent years, a number of barriers prevent deployment in medical practice. A notable concern is the potential to exacerbate entrenched biases and existing health disparities in society. The area of fairness in machine learning seeks to address these issues of equity; however, appropriate approaches are context-dependent, necessitating domain-specific consideration. We focus on clinical trials, i.e., research studies conducted on humans to evaluate medical treatments. Clinical trials are a relatively under-explored application in machine learning for healthcare, in part due to complex ethical, legal, and regulatory requirements and high costs. Our aim is to provide a multi-disciplinary assessment of how fairness for machine learning fits into the context of clinical trials research and practice. We start by reviewing the current ethical considerations and guidelines for clinical trials and examine their relationship with common definitions of fairness in machine learning. We examine potential sources of unfairness in clinical trials, providing concrete examples, and discuss the role machine learning might play in either mitigating potential biases or exacerbating them when applied without care. Particular focus is given to adaptive clinical trials, which may employ machine learning. Finally, we highlight concepts that require further investigation and development, and emphasize new approaches to fairness that may be relevant to the design of clinical trials.
    Deep Features for CBIR with Scarce Data using Hebbian Learning. (arXiv:2205.08935v1 [cs.CV])
    Features extracted from Deep Neural Networks (DNNs) have proven to be very effective in the context of Content Based Image Retrieval (CBIR). In recent work, biologically inspired \textit{Hebbian} learning algorithms have shown promises for DNN training. In this contribution, we study the performance of such algorithms in the development of feature extractors for CBIR tasks. Specifically, we consider a semi-supervised learning strategy in two steps: first, an unsupervised pre-training stage is performed using Hebbian learning on the image dataset; second, the network is fine-tuned using supervised Stochastic Gradient Descent (SGD) training. For the unsupervised pre-training stage, we explore the nonlinear Hebbian Principal Component Analysis (HPCA) learning rule. For the supervised fine-tuning stage, we assume sample efficiency scenarios, in which the amount of labeled samples is just a small fraction of the whole dataset. Our experimental analysis, conducted on the CIFAR10 and CIFAR100 datasets shows that, when few labeled samples are available, our Hebbian approach provides relevant improvements compared to various alternative methods.
    One-way Explainability Isn't The Message. (arXiv:2205.08954v1 [cs.LG])
    Recent engineering developments in specialised computational hardware, data-acquisition and storage technology have seen the emergence of Machine Learning (ML) as a powerful form of data analysis with widespread applicability beyond its historical roots in the design of autonomous agents. However -- possibly because of its origins in the development of agents capable of self-discovery -- relatively little attention has been paid to the interaction between people and ML. In this paper we are concerned with the use of ML in automated or semi-automated tools that assist one or more human decision makers. We argue that requirements on both human and machine in this context are significantly different to the use of ML either as part of autonomous agents for self-discovery or as part statistical data analysis. Our principal position is that the design of such human-machine systems should be driven by repeated, two-way intelligibility of information rather than one-way explainability of the ML-system's recommendations. Iterated rounds of intelligible information exchange, we think, will characterise the kinds of collaboration that will be needed to understand complex phenomena for which neither man or machine have complete answers. We propose operational principles -- we call them Intelligibility Axioms -- to guide the design of a collaborative decision-support system. The principles are concerned with: (a) what it means for information provided by the human to be intelligible to the ML system; and (b) what it means for an explanation provided by an ML system to be intelligible to a human. Using examples from the literature on the use of ML for drug-design and in medicine, we demonstrate cases where the conditions of the axioms are met. We describe some additional requirements needed for the design of a truly collaborative decision-support system.
    Predicting failure characteristics of structural materials via deep learning based on nondestructive void topology. (arXiv:2205.09075v1 [cond-mat.mtrl-sci])
    Accurate predictions of the failure progression of structural materials is critical for preventing failure-induced accidents. Despite considerable mechanics modeling-based efforts, accurate prediction remains a challenging task in real-world environments due to unexpected damage factors and defect evolutions. Here, we report a novel method for predicting material failure characteristics that uniquely combines nondestructive X-ray computed tomography (X-CT), persistent homology (PH), and deep multimodal learning (DML). The combined method exploits the microstructural defect state at the time of material examination as an input, and outputs the failure-related properties. Our method is demonstrated to be effective using two types of fracture datasets (tensile and fatigue datasets) with ferritic low alloy steel as a representative structural material. The method achieves a mean absolute error (MAE) of 0.09 in predicting the local strain with the tensile dataset and an MAE of 0.14 in predicting the fracture progress with the fatigue dataset. These high accuracies are mainly due to PH processing of the X-CT images, which transforms complex and noisy three-dimensional X-CT images into compact two-dimensional persistence diagrams that preserve key topological features such as the internal void size, density, and distribution. The combined PH and DML processing of 3D X-CT data is our unique approach enabling reliable failure predictions at the time of material examination based on void topology progressions, and the method can be extended to various nondestructive failure tests for practical use.
    On-device modeling of user's social context and familiar places from smartphone-embedded sensor data. (arXiv:2205.08790v1 [cs.LG])
    Context modeling and recognition represent complex tasks that allow mobile and ubiquitous computing applications to adapt to the user's situation. Current solutions mainly focus on limited context information generally processed on centralized architectures, potentially exposing users' personal data to privacy leakage, and missing personalization features. For these reasons on-device context modeling and recognition represent the current research trend in this area. Among the different information characterizing the user's context in mobile environments, social interactions and visited locations remarkably contribute to the characterization of daily life scenarios. In this paper we propose a novel, unsupervised and lightweight approach to model the user's social context and her locations based on ego networks directly on the user mobile device. Relying on this model, the system is able to extract high-level and semantic-rich context features from smartphone-embedded sensors data. Specifically, for the social context it exploits data related to both physical and cyber social interactions among users and their devices. As far as location context is concerned, we assume that it is more relevant to model the familiarity degree of a specific location for the user's context than the raw location data, both in terms of GPS coordinates and proximity devices. By using 5 real-world datasets, we assess the structure of the social and location ego networks, we provide a semantic evaluation of the proposed models and a complexity evaluation in terms of mobile computing performance. Finally, we demonstrate the relevance of the extracted features by showing the performance of 3 machine learning algorithms to recognize daily-life situations, obtaining an improvement of 3% of AUROC, 9% of Precision, and 5% in terms of Recall with respect to use only features related to physical context.
    Policy Distillation with Selective Input Gradient Regularization for Efficient Interpretability. (arXiv:2205.08685v1 [cs.LG])
    Although deep Reinforcement Learning (RL) has proven successful in a wide range of tasks, one challenge it faces is interpretability when applied to real-world problems. Saliency maps are frequently used to provide interpretability for deep neural networks. However, in the RL domain, existing saliency map approaches are either computationally expensive and thus cannot satisfy the real-time requirement of real-world scenarios or cannot produce interpretable saliency maps for RL policies. In this work, we propose an approach of Distillation with selective Input Gradient Regularization (DIGR) which uses policy distillation and input gradient regularization to produce new policies that achieve both high interpretability and computation efficiency in generating saliency maps. Our approach is also found to improve the robustness of RL policies to multiple adversarial attacks. We conduct experiments on three tasks, MiniGrid (Fetch Object), Atari (Breakout) and CARLA Autonomous Driving, to demonstrate the importance and effectiveness of our approach.
    Optimal Adaptive Prediction Intervals for Electricity Load Forecasting in Distribution Systems via Reinforcement Learning. (arXiv:2205.08698v1 [stat.AP])
    Prediction intervals offer an effective tool for quantifying the uncertainty of loads in distribution systems. The traditional central PIs cannot adapt well to skewed distributions, and their offline training fashion is vulnerable to unforeseen changes in future load patterns. Therefore, we propose an optimal PI estimation approach, which is online and adaptive to different data distributions by adaptively determining symmetric or asymmetric probability proportion pairs for quantiles. It relies on the online learning ability of reinforcement learning to integrate the two online tasks, i.e., the adaptive selection of probability proportion pairs and quantile predictions, both of which are modeled by neural networks. As such, the quality of quantiles-formed PI can guide the selection process of optimal probability proportion pairs, which forms a closed loop to improve the quality of PIs. Furthermore, to improve the learning efficiency of quantile forecasts, a prioritized experience replay strategy is proposed for online quantile regression processes. Case studies on both load and net load demonstrate that the proposed method can better adapt to data distribution compared with online central PIs method. Compared with offline-trained methods, it obtains PIs with better quality and is more robust against concept drift.
    Customizing ML Predictions for Online Algorithms. (arXiv:2205.08715v1 [cs.LG])
    A popular line of recent research incorporates ML advice in the design of online algorithms to improve their performance in typical instances. These papers treat the ML algorithm as a black-box, and redesign online algorithms to take advantage of ML predictions. In this paper, we ask the complementary question: can we redesign ML algorithms to provide better predictions for online algorithms? We explore this question in the context of the classic rent-or-buy problem, and show that incorporating optimization benchmarks in ML loss functions leads to significantly better performance, while maintaining a worst-case adversarial result when the advice is completely wrong. We support this finding both through theoretical bounds and numerical simulations.
    No More Pesky Hyperparameters: Offline Hyperparameter Tuning for RL. (arXiv:2205.08716v1 [cs.LG])
    The performance of reinforcement learning (RL) agents is sensitive to the choice of hyperparameters. In real-world settings like robotics or industrial control systems, however, testing different hyperparameter configurations directly on the environment can be financially prohibitive, dangerous, or time consuming. We propose a new approach to tune hyperparameters from offline logs of data, to fully specify the hyperparameters for an RL agent that learns online in the real world. The approach is conceptually simple: we first learn a model of the environment from the offline data, which we call a calibration model, and then simulate learning in the calibration model to identify promising hyperparameters. We identify several criteria to make this strategy effective, and develop an approach that satisfies these criteria. We empirically investigate the method in a variety of settings to identify when it is effective and when it fails.  ( 2 min )
    Hyperparameter Optimization with Neural Network Pruning. (arXiv:2205.08695v1 [cs.CV])
    Since the deep learning model is highly dependent on hyperparameters, hyperparameter optimization is essential in developing deep learning model-based applications, even if it takes a long time. As service development using deep learning models has gradually become competitive, many developers highly demand rapid hyperparameter optimization algorithms. In order to keep pace with the needs of faster hyperparameter optimization algorithms, researchers are focusing on improving the speed of hyperparameter optimization algorithm. However, the huge time consumption of hyperparameter optimization due to the high computational cost of the deep learning model itself has not been dealt with in-depth. Like using surrogate model in Bayesian optimization, to solve this problem, it is necessary to consider proxy model for a neural network (N_B) to be used for hyperparameter optimization. Inspired by the main goal of neural network pruning, i.e., high computational cost reduction and performance preservation, we presumed that the neural network (N_P) obtained through neural network pruning would be a good proxy model of N_B. In order to verify our idea, we performed extensive experiments by using CIFAR10, CFIAR100, and TinyImageNet datasets and three generally-used neural networks and three representative hyperparameter optmization methods. Through these experiments, we verified that N_P can be a good proxy model of N_B for rapid hyperparameter optimization. The proposed hyperparameter optimization framework can reduce the amount of time up to 37%.  ( 2 min )
    Marginal and Joint Cross-Entropies & Predictives for Online Bayesian Inference, Active Learning, and Active Sampling. (arXiv:2205.08766v1 [cs.LG])
    Principled Bayesian deep learning (BDL) does not live up to its potential when we only focus on marginal predictive distributions (marginal predictives). Recent works have highlighted the importance of joint predictives for (Bayesian) sequential decision making from a theoretical and synthetic perspective. We provide additional practical arguments grounded in real-world applications for focusing on joint predictives: we discuss online Bayesian inference, which would allow us to make predictions while taking into account additional data without retraining, and we propose new challenging evaluation settings using active learning and active sampling. These settings are motivated by an examination of marginal and joint predictives, their respective cross-entropies, and their place in offline and online learning. They are more realistic than previously suggested ones, building on work by Wen et al. (2021) and Osband et al. (2022), and focus on evaluating the performance of approximate BNNs in an online supervised setting. Initial experiments, however, raise questions on the feasibility of these ideas in high-dimensional parameter spaces with current BDL inference techniques, and we suggest experiments that might help shed further light on the practicality of current research for these problems. Importantly, our work highlights previously unidentified gaps in current research and the need for better approximate joint predictives.  ( 2 min )
    Spatial-Temporal Interactive Dynamic Graph Convolution Network for Traffic Forecasting. (arXiv:2205.08689v1 [cs.LG])
    Accurate traffic forecasting is essential for smart cities to achieve traffic flow control, route planning, and detection. Although many spatial-temporal methods are currently proposed, these methods are deficient in capturing the spatial-temporal dependence of traffic data synchronously. In addition, most of the methods ignore the dynamically changing correlations between road network nodes that arise as traffic data changes. To address the above challenges, we propose a neural network-based Spatial-Temporal Interactive Dynamic Graph Convolutional Network (STIDGCN) for traffic forecasting in this paper. In STIDGCN, we propose an interactive dynamic graph convolution structure, which first divides the sequences at intervals and captures the spatial-temporal dependence of the traffic data simultaneously through an interactive learning strategy for effective long-term prediction. We propose a novel dynamic graph convolution module consisting of a graph generator, fusion graph convolution. The dynamic graph convolution module can use the input traffic data, pre-defined graph structure to generate a graph structure and fuse it with the defined adaptive adjacency matrix, which is used to achieve the filling of the pre-defined graph structure and simulate the generation of dynamic associations between nodes in the road network. Extensive experiments on four real-world traffic flow datasets demonstrate that STIDGCN outperforms the state-of-the-art baseline.  ( 2 min )
    Deep-learned orthogonal basis patterns for fast, noise-robust single-pixel imaging. (arXiv:2205.08736v1 [eess.IV])
    Single-pixel imaging (SPI) is a novel, unconventional method that goes beyond the notion of traditional cameras but can be computationally expensive and slow for real-time applications. Deep learning has been proposed as an alternative approach for solving the SPI reconstruction problem, but a detailed analysis of its performance and generated basis patterns when used for SPI is limited. We present a modified deep convolutional autoencoder network (DCAN) for SPI on 64x64 pixel images with up to 6.25% compression ratio and apply binary and orthogonality regularizers during training. Training a DCAN with these regularizers allows it to learn multiple measurement bases that have combinations of binary or non-binary, and orthogonal or non-orthogonal patterns. We compare the reconstruction quality, orthogonality of the patterns, and robustness to noise of the resulting DCAN models to traditional SPI reconstruction algorithms (such as Total Variation minimization and Fourier Transform). Our DCAN models can be trained to be robust to noise while still having fast enough reconstruction times (~3 ms per frame) to be viable for real-time imaging.  ( 2 min )
    Revisiting PINNs: Generative Adversarial Physics-informed Neural Networks and Point-weighting Method. (arXiv:2205.08754v1 [cs.LG])
    Physics-informed neural networks (PINNs) provide a deep learning framework for numerically solving partial differential equations (PDEs), and have been widely used in a variety of PDE problems. However, there still remain some challenges in the application of PINNs: 1) the mechanism of PINNs is unsuitable (at least cannot be directly applied) to exploiting a small size of (usually very few) extra informative samples to refine the networks; and 2) the efficiency of training PINNs often becomes low for some complicated PDEs. In this paper, we propose the generative adversarial physics-informed neural network (GA-PINN), which integrates the generative adversarial (GA) mechanism with the structure of PINNs, to improve the performance of PINNs by exploiting only a small size of exact solutions to the PDEs. Inspired from the weighting strategy of the Adaboost method, we then introduce a point-weighting (PW) method to improve the training efficiency of PINNs, where the weight of each sample point is adaptively updated at each training iteration. The numerical experiments show that GA-PINNs outperform PINNs in many well-known PDEs and the PW method also improves the efficiency of training PINNs and GA-PINNs.  ( 2 min )
    CARNet: A Dynamic Autoencoder for Learning Latent Dynamics in Autonomous Driving Tasks. (arXiv:2205.08712v1 [cs.LG])
    Autonomous driving has received a lot of attention in the automotive industry and is often seen as the future of transportation. Passenger vehicles equipped with a wide array of sensors (e.g., cameras, front-facing radars, LiDARs, and IMUs) capable of continuous perception of the environment are becoming increasingly prevalent. These sensors provide a stream of high-dimensional, temporally correlated data that is essential for reliable autonomous driving. An autonomous driving system should effectively use the information collected from the various sensors in order to form an abstract description of the world and maintain situational awareness. Deep learning models, such as autoencoders, can be used for that purpose, as they can learn compact latent representations from a stream of incoming data. However, most autoencoder models process the data independently, without assuming any temporal interdependencies. Thus, there is a need for deep learning models that explicitly consider the temporal dependence of the data in their architecture. This work proposes CARNet, a Combined dynAmic autoencodeR NETwork architecture that utilizes an autoencoder combined with a recurrent neural network to learn the current latent representation and, in addition, also predict future latent representations in the context of autonomous driving. We demonstrate the efficacy of the proposed model in both imitation and reinforcement learning settings using both simulated and real datasets. Our results show that the proposed model outperforms the baseline state-of-the-art model, while having significantly fewer trainable parameters.  ( 2 min )
    QAPPA: Quantization-Aware Power, Performance, and Area Modeling of DNN Accelerators. (arXiv:2205.08648v1 [cs.AR])
    As the machine learning and systems community strives to achieve higher energy-efficiency through custom DNN accelerators and model compression techniques, there is a need for a design space exploration framework that incorporates quantization-aware processing elements into the accelerator design space while having accurate and fast power, performance, and area models. In this work, we present QAPPA, a highly parameterized quantization-aware power, performance, and area modeling framework for DNN accelerators. Our framework can facilitate the future research on design space exploration of DNN accelerators for various design choices such as bit precision, processing element type, scratchpad sizes of processing elements, global buffer size, device bandwidth, number of total processing elements in the the design, and DNN workloads. Our results show that different bit precisions and processing element types lead to significant differences in terms of performance per area and energy. Specifically, our proposed lightweight processing elements achieve up to 4.9x more performance per area and energy improvement when compared to INT16 based implementation.  ( 2 min )
    Evaluation of Transfer Learning for Polish with a Text-to-Text Model. (arXiv:2205.08808v1 [cs.CL])
    We introduce a new benchmark for assessing the quality of text-to-text models for Polish. The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering. In particular, since summarization and question answering lack benchmark datasets for the Polish language, we describe their construction and make them publicly available. Additionally, we present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective. Unsupervised denoising pre-training is performed efficiently by initializing the model weights with a multi-lingual T5 (mT5) counterpart. We evaluate the performance of plT5, mT5, Polish BART (plBART), and Polish GPT-2 (papuGaPT2). The plT5 scores top on all of these tasks except summarization, where plBART is best. In general (except for summarization), the larger the model, the better the results. The encoder-decoder architectures prove to be better than the decoder-only equivalent.  ( 2 min )
    Probability trees and the value of a single intervention. (arXiv:2205.08779v1 [cs.LG])
    The most fundamental problem in statistical causality is determining causal relationships from limited data. Probability trees, which combine prior causal structures with Bayesian updates, have been suggested as a possible solution. In this work, we quantify the information gain from a single intervention and show that both the anticipated information gain, prior to making an intervention, and the expected gain from an intervention have simple expressions. This results in an active-learning method that simply selects the intervention with the highest anticipated gain, which we illustrate through several examples. Our work demonstrates how probability trees, and Bayesian estimation of their parameters, offer a simple yet viable approach to fast causal induction.  ( 2 min )
    Markov Chain Monte Carlo for Continuous-Time Switching Dynamical Systems. (arXiv:2205.08803v1 [cs.LG])
    Switching dynamical systems are an expressive model class for the analysis of time-series data. As in many fields within the natural and engineering sciences, the systems under study typically evolve continuously in time, it is natural to consider continuous-time model formulations consisting of switching stochastic differential equations governed by an underlying Markov jump process. Inference in these types of models is however notoriously difficult, and tractable computational schemes are rare. In this work, we propose a novel inference algorithm utilizing a Markov Chain Monte Carlo approach. The presented Gibbs sampler allows to efficiently obtain samples from the exact continuous-time posterior processes. Our framework naturally enables Bayesian parameter estimation, and we also include an estimate for the diffusion covariance, which is oftentimes assumed fixed in stochastic differential equation models. We evaluate our framework under the modeling assumption and compare it against an existing variational inference approach.  ( 2 min )
    Accurate Fairness: Improving Individual Fairness without Trading Accuracy. (arXiv:2205.08704v1 [cs.LG])
    Accuracy and fairness are both crucial aspects for trustworthy machine learning. However, in practice, enhancing one aspect may sacrifice the other inevitably. We propose in this paper a new fairness criterion, accurate fairness, to assess whether an individual is treated both accurately and fairly regardless of protected attributes. We further propose new fairness metrics, fair-precision, fair-recall and fair-F1 score, to evaluate the reliability of a machine learning model from the perspective of accurate fairness. Thus, the side effects of enhancing just one of the two aspects, i.e., true bias and false fairness, can be effectively identified with our criterion. We then present a fair Siamese approach for accurate fairness training. To the best of our knowledge, this is the first time that a Siamese approach is adapted for bias mitigation. Case studies with typical fairness benchmarks demonstrate that our fair Siamese approach can, on average, promote the 17.4% higher individual fairness, the 11.5% higher fair-F1 score, and the 4.7% higher accuracy of a machine learning model than the state-of-the-art bias mitigation techniques. Finally, our approach is applied to mitigate the possible service discrimination with a real Ctrip dataset, by fairly serving on average 97.9% customers with different consumption habits who pay the same prices for the same rooms (20.7% more than original models).  ( 2 min )
    The Solvability of Interpretability Evaluation Metrics. (arXiv:2205.08696v1 [cs.LG])
    Feature attribution methods are popular for explaining neural network predictions, and they are often evaluated on metrics such as comprehensiveness and sufficiency, which are motivated by the principle that more important features -- as judged by the explanation -- should have larger impacts on model prediction. In this paper, we highlight an intriguing property of these metrics: their solvability. Concretely, we can define the problem of optimizing an explanation for a metric and solve it using beam search. This brings up the obvious question: given such solvability, why do we still develop other explainers and then evaluate them on the metric? We present a series of investigations showing that this beam search explainer is generally comparable or favorable to current choices such as LIME and SHAP, suggest rethinking the goals of model interpretability, and identify several directions towards better evaluations of new method proposals.  ( 2 min )
    A Regression Approach to Learning-Augmented Online Algorithms. (arXiv:2205.08717v1 [cs.LG])
    The emerging field of learning-augmented online algorithms uses ML techniques to predict future input parameters and thereby improve the performance of online algorithms. Since these parameters are, in general, real-valued functions, a natural approach is to use regression techniques to make these predictions. We introduce this approach in this paper, and explore it in the context of a general online search framework that captures classic problems like (generalized) ski rental, bin packing, minimum makespan scheduling, etc. We show nearly tight bounds on the sample complexity of this regression problem, and extend our results to the agnostic setting. From a technical standpoint, we show that the key is to incorporate online optimization benchmarks in the design of the loss function for the regression problem, thereby diverging from the use of off-the-shelf regression tools with standard bounds on statistical error.  ( 2 min )
    Property Unlearning: A Defense Strategy Against Property Inference Attacks. (arXiv:2205.08821v1 [cs.CR])
    During the training of machine learning models, they may store or "learn" more information about the training data than what is actually needed for the prediction or classification task. This is exploited by property inference attacks which aim at extracting statistical properties from the training data of a given model without having access to the training data itself. These properties may include the quality of pictures to identify the camera model, the age distribution to reveal the target audience of a product, or the included host types to refine a malware attack in computer networks. This attack is especially accurate when the attacker has access to all model parameters, i.e., in a white-box scenario. By defending against such attacks, model owners are able to ensure that their training data, associated properties, and thus their intellectual property stays private, even if they deliberately share their models, e.g., to train collaboratively, or if models are leaked. In this paper, we introduce property unlearning, an effective defense mechanism against white-box property inference attacks, independent of the training data type, model task, or number of properties. Property unlearning mitigates property inference attacks by systematically changing the trained weights and biases of a target model such that an adversary cannot extract chosen properties. We empirically evaluate property unlearning on three different data sets, including tabular and image data, and two types of artificial neural networks. Our results show that property unlearning is both efficient and reliable to protect machine learning models against property inference attacks, with a good privacy-utility trade-off. Furthermore, our approach indicates that this mechanism is also effective to unlearn multiple properties.  ( 2 min )
    TTAPS: Test-Time Adaption by Aligning Prototypes using Self-Supervision. (arXiv:2205.08731v1 [cs.LG])
    Nowadays, deep neural networks outperform humans in many tasks. However, if the input distribution drifts away from the one used in training, their performance drops significantly. Recently published research has shown that adapting the model parameters to the test sample can mitigate this performance degradation. In this paper, we therefore propose a novel modification of the self-supervised training algorithm SwAV that adds the ability to adapt to single test samples. Using the provided prototypes of SwAV and our derived test-time loss, we align the representation of unseen test samples with the self-supervised learned prototypes. We show the success of our method on the common benchmark dataset CIFAR10-C.  ( 2 min )
    It Isn't Sh!tposting, It's My CAT Posting. (arXiv:2205.08710v1 [cs.CV])
    In this paper, we describe a novel architecture which can generate hilarious captions for a given input image. The architecture is split into two halves, i.e. image captioning and hilarious text conversion. The architecture starts with a pre-trained CNN model, VGG16 in this implementation, and applies attention LSTM on it to generate normal caption. These normal captions then are fed forward to our hilarious text conversion transformer which converts this text into something hilarious while maintaining the context of the input image. The architecture can also be split into two halves and only the seq2seq transformer can be used to generate hilarious caption by inputting a sentence.This paper aims to help everyday user to be more lazy and hilarious at the same time by generating captions using CATNet.  ( 2 min )
    Exact Gaussian Processes for Massive Datasets via Non-Stationary Sparsity-Discovering Kernels. (arXiv:2205.09070v1 [stat.ML])
    A Gaussian Process (GP) is a prominent mathematical framework for stochastic function approximation in science and engineering applications. This success is largely attributed to the GP's analytical tractability, robustness, non-parametric structure, and natural inclusion of uncertainty quantification. Unfortunately, the use of exact GPs is prohibitively expensive for large datasets due to their unfavorable numerical complexity of $O(N^3)$ in computation and $O(N^2)$ in storage. All existing methods addressing this issue utilize some form of approximation -- usually considering subsets of the full dataset or finding representative pseudo-points that render the covariance matrix well-structured and sparse. These approximate methods can lead to inaccuracies in function approximations and often limit the user's flexibility in designing expressive kernels. Instead of inducing sparsity via data-point geometry and structure, we propose to take advantage of naturally-occurring sparsity by allowing the kernel to discover -- instead of induce -- sparse structure. The premise of this paper is that GPs, in their most native form, are often naturally sparse, but commonly-used kernels do not allow us to exploit this sparsity. The core concept of exact, and at the same time sparse GPs relies on kernel definitions that provide enough flexibility to learn and encode not only non-zero but also zero covariances. This principle of ultra-flexible, compactly-supported, and non-stationary kernels, combined with HPC and constrained optimization, lets us scale exact GPs well beyond 5 million data points.
    Bagged Polynomial Regression and Neural Networks. (arXiv:2205.08609v1 [stat.ML])
    Series and polynomial regression are able to approximate the same function classes as neural networks. However, these methods are rarely used in practice, although they offer more interpretability than neural networks. In this paper, we show that a potential reason for this is the slow convergence rate of polynomial regression estimators and propose the use of bagged polynomial regression (BPR) as an attractive alternative to neural networks. Theoretically, we derive new finite sample and asymptotic $L^2$ convergence rates for series estimators. We show that the rates can be improved in smooth settings by splitting the feature space and generating polynomial features separately for each partition. Empirically, we show that our proposed estimator, the BPR, can perform as well as more complex models with more parameters. Our estimator also performs close to state-of-the-art prediction methods in the benchmark MNIST handwritten digit dataset.  ( 2 min )
    Need is All You Need: Homeostatic Neural Networks Adapt to Concept Shift. (arXiv:2205.08645v1 [cs.LG])
    In living organisms, homeostasis is the natural regulation of internal states aimed at maintaining conditions compatible with life. Typical artificial systems are not equipped with comparable regulatory features. Here, we introduce an artificial neural network that incorporates homeostatic features. Its own computing substrate is placed in a needful and vulnerable relation to the very objects over which it computes. For example, artificial neurons performing classification of MNIST digits or Fashion-MNIST articles of clothing may receive excitatory or inhibitory effects, which alter their own learning rate as a direct result of perceiving and classifying the digits. In this scenario, accurate recognition is desirable to the agent itself because it guides decisions to regulate its vulnerable internal states and functionality. Counterintuitively, the addition of vulnerability to a learner does not necessarily impair its performance. On the contrary, self-regulation in response to vulnerability confers benefits under certain conditions. We show that homeostatic design confers increased adaptability under concept shift, in which the relationships between labels and data change over time, and that the greatest advantages are obtained under the highest rates of shift. This necessitates the rapid un-learning of past associations and the re-learning of new ones. We also demonstrate the superior abilities of homeostatic learners in environments with dynamically changing rates of concept shift. Our homeostatic design exposes the artificial neural network's thinking machinery to the consequences of its own "thoughts", illustrating the advantage of putting one's own "skin in the game" to improve fluid intelligence.  ( 2 min )
    Label-Efficient Self-Supervised Federated Learning for Tackling Data Heterogeneity in Medical Imaging. (arXiv:2205.08576v1 [cs.CV])
    The curation of large-scale medical datasets from multiple institutions necessary for training deep learning models is challenged by the difficulty in sharing patient data with privacy-preserving. Federated learning (FL), a paradigm that enables privacy-protected collaborative learning among different institutions, is a promising solution to this challenge. However, FL generally suffers from performance deterioration due to heterogeneous data distributions across institutions and the lack of quality labeled data. In this paper, we present a robust and label-efficient self-supervised FL framework for medical image analysis. Specifically, we introduce a novel distributed self-supervised pre-training paradigm into the existing FL pipeline (i.e., pre-training the models directly on the decentralized target task datasets). Built upon the recent success of Vision Transformers, we employ masked image encoding tasks for self-supervised pre-training, to facilitate more effective knowledge transfer to downstream federated models. Extensive empirical results on simulated and real-world medical imaging federated datasets show that self-supervised pre-training largely benefits the robustness of federated models against various degrees of data heterogeneity. Notably, under severe data heterogeneity, our method, without relying on any additional pre-training data, achieves an improvement of 5.06%, 1.53% and 4.58% in test accuracy on retinal, dermatology and chest X-ray classification compared with the supervised baseline with ImageNet pre-training. Moreover, we show that our self-supervised FL algorithm generalizes well to out-of-distribution data and learns federated models more effectively in limited label scenarios, surpassing the supervised baseline by 10.36% and the semi-supervised FL method by 8.3% in test accuracy.  ( 2 min )
    Variational Quantum Compressed Sensing for Joint User and Channel State Acquisition in Grant-Free Device Access Systems. (arXiv:2205.08603v1 [eess.SP])
    This paper introduces a new quantum computing framework integrated with a two-step compressed sensing technique, applied to a joint channel estimation and user identification problem. We propose a variational quantum circuit (VQC) design as a new denoising solution. For a practical grant-free communications system having correlated device activities, variational quantum parameters for Pauli rotation gates in the proposed VQC system are optimized to facilitate to the non-linear estimation. Numerical results show that the VQC method can outperform modern compressed sensing techniques using an element-wise denoiser.  ( 2 min )
    Learning Quantum Entanglement Distillation with Noisy Classical Communications. (arXiv:2205.08561v1 [quant-ph])
    Quantum networking relies on the management and exploitation of entanglement. Practical sources of entangled qubits are imperfect, producing mixed quantum state with reduced fidelity with respect to ideal Bell pairs. Therefore, an important primitive for quantum networking is entanglement distillation, whose goal is to enhance the fidelity of entangled qubits through local operations and classical communication (LOCC). Existing distillation protocols assume the availability of ideal, noiseless, communication channels. In this paper, we study the case in which communication takes place over noisy binary symmetric channels. We propose to implement local processing through parameterized quantum circuits (PQCs) that are optimized to maximize the average fidelity, while accounting for communication errors. The introduced approach, Noise Aware-LOCCNet (NA-LOCCNet), is shown to have significant advantages over existing protocols designed for noiseless communications.  ( 2 min )
    Hierarchical Distribution-Aware Testing of Deep Learning. (arXiv:2205.08589v1 [cs.SE])
    With its growing use in safety/security-critical applications, Deep Learning (DL) has raised increasing concerns regarding its dependability. In particular, DL has a notorious problem of lacking robustness. Despite recent efforts made in detecting Adversarial Examples (AEs) via state-of-the-art attacking and testing methods, they are normally input distribution agnostic and/or disregard the perception quality of AEs. Consequently, the detected AEs are irrelevant inputs in the application context or unnatural/unrealistic that can be easily noticed by humans. This may lead to a limited effect on improving the DL model's dependability, as the testing budget is likely to be wasted on detecting AEs that are encountered very rarely in its real-life operations. In this paper, we propose a new robustness testing approach for detecting AEs that considers both the input distribution and the perceptual quality of inputs. The two considerations are encoded by a novel hierarchical mechanism. First, at the feature level, the input data distribution is extracted and approximated by data compression techniques and probability density estimators. Such quantified feature level distribution, together with indicators that are highly correlated with local robustness, are considered in selecting test seeds. Given a test seed, we then develop a two-step genetic algorithm for local test case generation at the pixel level, in which two fitness functions work alternatively to control the quality of detected AEs. Finally, extensive experiments confirm that our holistic approach considering hierarchical distributions at feature and pixel levels is superior to state-of-the-arts that either disregard any input distribution or only consider a single (non-hierarchical) distribution, in terms of not only the quality of detected AEs but also improving the overall robustness of the DL model under testing.  ( 2 min )
    Frank Wolfe Meets Metric Entropy. (arXiv:2205.08634v1 [stat.ML])
    The Frank-Wolfe algorithm has seen a resurgence in popularity due to its ability to efficiently solve constrained optimization problems in machine learning and high-dimensional statistics. As such, there is much interest in establishing when the algorithm may possess a "linear" $O(\log(1/\epsilon))$ dimension-free iteration complexity comparable to projected gradient descent. In this paper, we provide a general technique for establishing domain specific and easy-to-estimate lower bounds for Frank-Wolfe and its variants using the metric entropy of the domain. Most notably, we show that a dimension-free linear upper bound must fail not only in the worst case, but in the \emph{average case}: for a Gaussian or spherical random polytope in $\mathbb{R}^d$ with $\mathrm{poly}(d)$ vertices, Frank-Wolfe requires up to $\tilde\Omega(d)$ iterations to achieve a $O(1/d)$ error bound, with high probability. We also establish this phenomenon for the nuclear norm ball. The link with metric entropy also has interesting positive implications for conditional gradient algorithms in statistics, such as gradient boosting and matching pursuit. In particular, we show that it is possible to extract fast-decaying upper bounds on the excess risk directly from an analysis of the underlying optimization procedure.  ( 2 min )
    Quantum Transfer Learning for Wi-Fi Sensing. (arXiv:2205.08590v1 [cs.LG])
    Beyond data communications, commercial-off-the-shelf Wi-Fi devices can be used to monitor human activities, track device locomotion, and sense the ambient environment. In particular, spatial beam attributes that are inherently available in the 60-GHz IEEE 802.11ad/ay standards have shown to be effective in terms of overhead and channel measurement granularity for these indoor sensing tasks. In this paper, we investigate transfer learning to mitigate domain shift in human monitoring tasks when Wi-Fi settings and environments change over time. As a proof-of-concept study, we consider quantum neural networks (QNN) as well as classical deep neural networks (DNN) for the future quantum-ready society. The effectiveness of both DNN and QNN is validated by an in-house experiment for human pose recognition, achieving greater than 90% accuracy with a limited data size.  ( 2 min )
    Classification as Direction Recovery: Improved Guarantees via Scale Invariance. (arXiv:2205.08633v1 [stat.ML])
    Modern algorithms for binary classification rely on an intermediate regression problem for computational tractability. In this paper, we establish a geometric distinction between classification and regression that allows risk in these two settings to be more precisely related. In particular, we note that classification risk depends only on the direction of the regressor, and we take advantage of this scale invariance to improve existing guarantees for how classification risk is bounded by the risk in the intermediate regression problem. Building on these guarantees, our analysis makes it possible to compare algorithms more accurately against each other and suggests viewing classification as unique from regression rather than a byproduct of it. While regression aims to converge toward the conditional expectation function in location, we propose that classification should instead aim to recover its direction.  ( 2 min )
    Strategizing against Learners in Bayesian Games. (arXiv:2205.08562v1 [cs.LG])
    We study repeated two-player games where one of the players, the learner, employs a no-regret learning strategy, while the other, the optimizer, is a rational utility maximizer. We consider general Bayesian games, where the payoffs of both the optimizer and the learner could depend on the type, which is drawn from a publicly known distribution, but revealed privately to the learner. We address the following questions: (a) what is the bare minimum that the optimizer can guarantee to obtain regardless of the no-regret learning algorithm employed by the learner? (b) are there learning algorithms that cap the optimizer payoff at this minimum? (c) can these algorithms be implemented efficiently? While building this theory of optimizer-learner interactions, we define a new combinatorial notion of regret called polytope swap regret, that could be of independent interest in other settings.  ( 2 min )
    All-Photonic Artificial Neural Network Processor Via Non-linear Optics. (arXiv:2205.08608v1 [physics.optics])
    Optics and photonics has recently captured interest as a platform to accelerate linear matrix processing, that has been deemed as a bottleneck in traditional digital electronic architectures. In this paper, we propose an all-photonic artificial neural network processor wherein information is encoded in the amplitudes of frequency modes that act as neurons. The weights among connected layers are encoded in the amplitude of controlled frequency modes that act as pumps. Interaction among these modes for information processing is enabled by non-linear optical processes. Both the matrix multiplication and element-wise activation functions are performed through coherent processes, enabling the direct representation of negative and complex numbers without the use of detectors or digital electronics. Via numerical simulations, we show that our design achieves a performance commensurate with present-day state-of-the-art computational networks on image-classification benchmarks. Our architecture is unique in providing a completely unitary, reversible mode of computation. Additionally, the computational speed increases with the power of the pumps to arbitrarily high rates, as long as the circuitry can sustain the higher optical power.  ( 2 min )
    Multibit Tries Packet Classification with Deep Reinforcement Learning. (arXiv:2205.08606v1 [cs.NI])
    High performance packet classification is a key component to support scalable network applications like firewalls, intrusion detection, and differentiated services. With ever increasing in the line-rate in core networks, it becomes a great challenge to design a scalable and high performance packet classification solution using hand-tuned heuristics approaches. In this paper, we present a scalable learning-based packet classification engine and its performance evaluation. By exploiting the sparsity of ruleset, our algorithm uses a few effective bits (EBs) to extract a large number of candidate rules with just a few of memory access. These effective bits are learned with deep reinforcement learning and they are used to create a bitmap to filter out the majority of rules which do not need to be full-matched to improve the online system performance. Moreover, our EBs learning-based selection method is independent of the ruleset, which can be applied to varying rulesets. Our multibit tries classification engine outperforms lookup time both in worst and average case by 55% and reduce memory footprint, compared to traditional decision tree without EBs.  ( 2 min )
    A graph representation of molecular ensembles for polymer property prediction. (arXiv:2205.08619v1 [cs.LG])
    Synthetic polymers are versatile and widely used materials. Similar to small organic molecules, a large chemical space of such materials is hypothetically accessible. Computational property prediction and virtual screening can accelerate polymer design by prioritizing candidates expected to have favorable properties. However, in contrast to organic molecules, polymers are often not well-defined single structures but an ensemble of similar molecules, which poses unique challenges to traditional chemical representations and machine learning approaches. Here, we introduce a graph representation of molecular ensembles and an associated graph neural network architecture that is tailored to polymer property prediction. We demonstrate that this approach captures critical features of polymeric materials, like chain architecture, monomer stoichiometry, and degree of polymerization, and achieves superior accuracy to off-the-shelf cheminformatics methodologies. While doing so, we built a dataset of simulated electron affinity and ionization potential values for >40k polymers with varying monomer composition, stoichiometry, and chain architecture, which may be used in the development of other tailored machine learning approaches. The dataset and machine learning models presented in this work pave the path toward new classes of algorithms for polymer informatics and, more broadly, introduce a framework for the modeling of molecular ensembles.  ( 2 min )
    Generic and Trend-aware Curriculum Learning for Relation Extraction in Graph Neural Networks. (arXiv:2205.08625v1 [cs.CL])
    We present a generic and trend-aware curriculum learning approach for graph neural networks. It extends existing approaches by incorporating sample-level loss trends to better discriminate easier from harder samples and schedule them for training. The model effectively integrates textual and structural information for relation extraction in text graphs. Experimental results show that the model provides robust estimations of sample difficulty and shows sizable improvement over the state-of-the-art approaches across several datasets.  ( 2 min )
    Learning to Learn Quantum Turbo Detection. (arXiv:2205.08611v1 [eess.SP])
    This paper investigates a turbo receiver employing a variational quantum circuit (VQC). The VQC is configured with an ansatz of the quantum approximate optimization algorithm (QAOA). We propose a 'learning to learn' (L2L) framework to optimize the turbo VQC decoder such that high fidelity soft-decision output is generated. Besides demonstrating the proposed algorithm's computational complexity, we show that the L2L VQC turbo decoder can achieve an excellent performance close to the optimal maximum-likelihood performance in a multiple-input multiple-output system.  ( 2 min )
    Deep Neural Network Classifier for Multi-dimensional Functional Data. (arXiv:2205.08592v1 [stat.ML])
    We propose a new approach, called as functional deep neural network (FDNN), for classifying multi-dimensional functional data. Specifically, a deep neural network is trained based on the principle components of the training data which shall be used to predict the class label of a future data function. Unlike the popular functional discriminant analysis approaches which rely on Gaussian assumption, the proposed FDNN approach applies to general non-Gaussian multi-dimensional functional data. Moreover, when the log density ratio possesses a locally connected functional modular structure, we show that FDNN achieves minimax optimality. The superiority of our approach is demonstrated through both simulated and real-world datasets.  ( 2 min )
    CV4Code: Sourcecode Understanding via Visual Code Representations. (arXiv:2205.08585v1 [cs.SE])
    We present CV4Code, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet by treating each snippet as a two-dimensional image, which naturally encodes the context and retains the underlying structural information through an explicit spatial representation. To codify snippets as images, we propose an ASCII codepoint-based image representation that facilitates fast generation of sourcecode images and eliminates redundancy in the encoding that would arise from an RGB pixel representation. Furthermore, as sourcecode is treated as images, neither lexical analysis (tokenisation) nor syntax tree parsing is required, which makes the proposed method agnostic to any particular programming language and lightweight from the application pipeline point of view. CV4Code can even featurise syntactically incorrect code which is not possible from methods that depend on the Abstract Syntax Tree (AST). We demonstrate the effectiveness of CV4Code by learning Convolutional and Transformer networks to predict the functional task, i.e. the problem it solves, of the source code directly from its two-dimensional representation, and using an embedding from its latent space to derive a similarity score of two code snippets in a retrieval setup. Experimental results show that our approach achieves state-of-the-art performance in comparison to other methods with the same task and data configurations. For the first time we show the benefits of treating sourcecode understanding as a form of image processing task.  ( 2 min )
    OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval. (arXiv:2205.08605v1 [cs.CL])
    Aligning parallel sentences in multilingual corpora is essential to curating data for downstream applications such as Machine Translation. In this work, we present OneAligner, an alignment model specially designed for sentence retrieval tasks. This model is able to train on only one language pair and transfers, in a cross-lingual fashion, to low-resource language pairs with negligible degradation in performance. When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result on the Tateoba dataset, outperforming an equally-sized previous model by 8.0 points in accuracy while using less than 0.6% of their parallel data. When finetuned on a single rich-resource language pair, be it English-centered or not, our model is able to match the performance of the ones finetuned on all language pairs under the same data budget with less than 2.0 points decrease in accuracy. Furthermore, with the same setup, scaling up the number of rich-resource language pairs monotonically improves the performance, reaching a minimum of 0.4 points discrepancy in accuracy, making it less mandatory to collect any low-resource parallel data. Finally, we conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size, up to a certain size threshold, rather than on what language pairs are used for training or evaluation.  ( 2 min )
    The Power of Reuse: A Multi-Scale Transformer Model for Structural Dynamic Segmentation in Symbolic Music Generation. (arXiv:2205.08579v1 [cs.SD])
    Symbolic Music Generation relies on the contextual representation capabilities of the generative model, where the most prevalent approach is the Transformer-based model. Not only that, the learning of long-term context is also related to the dynamic segmentation of musical structures, i.e. intro, verse and chorus, which is currently overlooked by the research community. In this paper, we propose a multi-scale Transformer, which uses coarse-decoder and fine-decoders to model the contexts at the global and section-level, respectively. Concretely, we designed a Fragment Scope Localization layer to syncopate the music into sections, which were later used to pre-train fine-decoders. After that, we designed a Music Style Normalization layer to transfer the style information from the original sections to the generated sections to achieve consistency in music style. The generated sections are combined in the aggregation layer and fine-tuned by the coarse decoder. Our model is evaluated on two open MIDI datasets, and experiments show that our model outperforms the best contemporary symbolic music generative models. More excitingly, visual evaluation shows that our model is superior in melody reuse, resulting in more realistic music.  ( 2 min )
    Universal characteristics of deep neural network loss surfaces from random matrix theory. (arXiv:2205.08601v1 [math-ph])
    This paper considers several aspects of random matrix universality in deep neural networks. Motivated by recent experimental work, we use universal properties of random matrices related to local statistics to derive practical implications for deep neural networks based on a realistic model of their Hessians. In particular we derive universal aspects of outliers in the spectra of deep neural networks and demonstrate the important role of random matrix local laws in popular pre-conditioning gradient descent algorithms. We also present insights into deep neural network loss surfaces from quite general arguments based on tools from statistical physics and random matrix theory.  ( 2 min )
  • Open

    A Unified Linear Speedup Analysis of Stochastic FedAvg and Nesterov Accelerated FedAvg. (arXiv:2007.05690v3 [cs.LG] UPDATED)
    Federated learning (FL) learns a model jointly from a set of participating devices without sharing each other's privately held data. The characteristics of non-i.i.d. data across the network, low device participation, high communication costs, and the mandate that data remain private bring challenges in understanding the convergence of FL algorithms, particularly with regards to how convergence scales with the number of participating devices. In this paper, we focus on Federated Averaging (FedAvg)--arguably the most popular and effective FL algorithm class in use today--and provide a unified and comprehensive study of its convergence rate. Although FedAvg has recently been studied by an emerging line of literature, a systematic study of how FedAvg's convergence scales with the number of participating devices in the fully heterogeneous FL setting is lacking--a crucial issue whose answer would shed light on the performance of FedAvg in large FL systems in practice. We fill this gap by providing a unified analysis that establishes convergence guarantees for FedAvg under strongly convex smooth, convex smooth problems, and overparameterized strongly convex smooth problems. We show that FedAvg enjoys linear speedup in each case, although with different convergence rates and communication efficiencies. While there have been linear speedup results from distributed optimization that assumes full participation, ours are the first to establish linear speedup for FedAvg under both statistical and system heterogeneity. For strongly convex and convex problems, we also characterize the corresponding convergence rates for the Nesterov accelerated FedAvg algorithm, which are the first linear speedup guarantees for momentum variants of FedAvg in convex settings. Empirical studies of the algorithms in various settings have supported our theoretical results.
    Maslow's Hammer for Catastrophic Forgetting: Node Re-Use vs Node Activation. (arXiv:2205.09029v1 [stat.ML])
    Continual learning - learning new tasks in sequence while maintaining performance on old tasks - remains particularly challenging for artificial neural networks. Surprisingly, the amount of forgetting does not increase with the dissimilarity between the learned tasks, but appears to be worst in an intermediate similarity regime. In this paper we theoretically analyse both a synthetic teacher-student framework and a real data setup to provide an explanation of this phenomenon that we name Maslow's hammer hypothesis. Our analysis reveals the presence of a trade-off between node activation and node re-use that results in worst forgetting in the intermediate regime. Using this understanding we reinterpret popular algorithmic interventions for catastrophic interference in terms of this trade-off, and identify the regimes in which they are most effective.
    Dependent Latent Class Models. (arXiv:2205.08677v1 [stat.ML])
    Latent Class Models (LCMs) are used to cluster multivariate categorical data (e.g. group participants based on survey responses). Traditional LCMs assume a property called conditional independence. This assumption can be restrictive, leading to model misspecification and overparameterization. To combat this problem, we developed a novel Bayesian model called a Dependent Latent Class Model (DLCM), which permits conditional dependence. We verify identifiability of DLCMs. We also demonstrate the effectiveness of DLCMs in both simulations and real-world applications. Compared to traditional LCMs, DLCMs are effective in applications with time series, overlapping items, and structural zeroes.
    Bayesian Discrete Conditional Transformation Models. (arXiv:2205.08594v1 [stat.ME])
    We propose a novel Bayesian model framework for discrete ordinal and count data based on conditional transformations of the responses. The conditional transformation function is estimated from the data in conjunction with an a priori chosen reference distribution. For count responses, the resulting transformation model is novel in the sense that it is a Bayesian fully parametric yet distribution-free approach that can additionally account for excess zeros with additive transformation function specifications. For ordinal categoric responses, our cumulative link transformation model allows the inclusion of linear and nonlinear covariate effects that can additionally be made category-specific, resulting in (non-)proportional odds or hazards models and more, depending on the choice of the reference distribution. Inference is conducted by a generic modular Markov chain Monte Carlo algorithm where multivariate Gaussian priors enforce specific properties such as smoothness on the functional effects. To illustrate the versatility of Bayesian discrete conditional transformation models, applications to counts of patent citations in the presence of excess zeros and on treating forest health categories in a discrete partial proportional odds model are presented.
    Deep Neural Network Classifier for Multi-dimensional Functional Data. (arXiv:2205.08592v1 [stat.ML])
    We propose a new approach, called as functional deep neural network (FDNN), for classifying multi-dimensional functional data. Specifically, a deep neural network is trained based on the principle components of the training data which shall be used to predict the class label of a future data function. Unlike the popular functional discriminant analysis approaches which rely on Gaussian assumption, the proposed FDNN approach applies to general non-Gaussian multi-dimensional functional data. Moreover, when the log density ratio possesses a locally connected functional modular structure, we show that FDNN achieves minimax optimality. The superiority of our approach is demonstrated through both simulated and real-world datasets.
    Frank Wolfe Meets Metric Entropy. (arXiv:2205.08634v1 [stat.ML])
    The Frank-Wolfe algorithm has seen a resurgence in popularity due to its ability to efficiently solve constrained optimization problems in machine learning and high-dimensional statistics. As such, there is much interest in establishing when the algorithm may possess a "linear" $O(\log(1/\epsilon))$ dimension-free iteration complexity comparable to projected gradient descent. In this paper, we provide a general technique for establishing domain specific and easy-to-estimate lower bounds for Frank-Wolfe and its variants using the metric entropy of the domain. Most notably, we show that a dimension-free linear upper bound must fail not only in the worst case, but in the \emph{average case}: for a Gaussian or spherical random polytope in $\mathbb{R}^d$ with $\mathrm{poly}(d)$ vertices, Frank-Wolfe requires up to $\tilde\Omega(d)$ iterations to achieve a $O(1/d)$ error bound, with high probability. We also establish this phenomenon for the nuclear norm ball. The link with metric entropy also has interesting positive implications for conditional gradient algorithms in statistics, such as gradient boosting and matching pursuit. In particular, we show that it is possible to extract fast-decaying upper bounds on the excess risk directly from an analysis of the underlying optimization procedure.
    Conformalized Online Learning: Online Calibration Without a Holdout Set. (arXiv:2205.09095v1 [cs.LG])
    We develop a framework for constructing uncertainty sets with a valid coverage guarantee in an online setting, in which the underlying data distribution can drastically -- and even adversarially -- shift over time. The technique we propose is highly flexible as it can be integrated with any online learning algorithm, requiring minimal implementation effort and computational cost. A key advantage of our method over existing alternatives -- which also build on conformal inference -- is that we do not need to split the data into training and holdout calibration sets. This allows us to fit the predictive model in a fully online manner, utilizing the most recent observation for constructing calibrated uncertainty sets. Consequently, and in contrast with existing techniques, (i) the sets we build can quickly adapt to new changes in the distribution; and (ii) our procedure does not require refitting the model at each time step. Using synthetic and real-world benchmark data sets, we demonstrate the validity of our theory and the improved performance of our proposal over existing techniques. To demonstrate the greater flexibility of the proposed method, we show how to construct valid intervals for a multiple-output regression problem that previous sequential calibration methods cannot handle due to impractical computational and memory requirements.
    Dynamic Predictions of Postoperative Complications from Explainable, Uncertainty-Aware, and Multi-Task Deep Neural Networks. (arXiv:2004.12551v2 [cs.LG] UPDATED)
    Accurate prediction of postoperative complications can inform shared decisions regarding prognosis, preoperative risk-reduction, and postoperative resource use. We hypothesized that multi-task deep learning models would outperform random forest models in predicting postoperative complications, and that integrating high-resolution intraoperative physiological time series would result in more granular and personalized health representations that would improve prognostication compared to preoperative predictions. In a longitudinal cohort study of 56,242 patients undergoing 67,481 inpatient surgical procedures at a university medical center, we compared deep learning models with random forests for predicting nine common postoperative complications using preoperative, intraoperative, and perioperative patient data. Our study indicated several significant results across experimental settings that suggest the utility of deep learning for capturing more precise representations of patient health for augmented surgical decision support. Multi-task learning improved efficiency by reducing computational resources without compromising predictive performance. Integrated gradients interpretability mechanisms identified potentially modifiable risk factors for each complication. Monte Carlo dropout methods provided a quantitative measure of prediction uncertainty that has the potential to enhance clinical trust. Multi-task learning, interpretability mechanisms, and uncertainty metrics demonstrated potential to facilitate effective clinical implementation.
    A simple yet effective baseline for non-attributed graph classification. (arXiv:1811.03508v3 [cs.LG] UPDATED)
    Graphs are complex objects that do not lend themselves easily to typical learning tasks. Recently, a range of approaches based on graph kernels or graph neural networks have been developed for graph classification and for representation learning on graphs in general. As the developed methodologies become more sophisticated, it is important to understand which components of the increasingly complex methods are necessary or most effective. As a first step, we develop a simple yet meaningful graph representation, and explore its effectiveness in graph classification. We test our baseline representation for the graph classification task on a range of graph datasets. Interestingly, this simple representation achieves similar performance as the state-of-the-art graph kernels and graph neural networks for non-attributed graph classification. Its performance on classifying attributed graphs is slightly weaker as it does not incorporate attributes. However, given its simplicity and efficiency, we believe that it still serves as an effective baseline for attributed graph classification. Our graph representation is efficient (linear-time) to compute. We also provide a simple connection with the graph neural networks. Note that these observations are only for the task of graph classification while existing methods are often designed for a broader scope including node embedding and link prediction. The results are also likely biased due to the limited amount of benchmark datasets available. Nevertheless, the good performance of our simple baseline calls for the development of new, more comprehensive benchmark datasets so as to better evaluate and analyze different graph learning methods. Furthermore, given the computational efficiency of our graph summary, we believe that it is a good candidate as a baseline method for future graph classification (or even other graph learning) studies.
    Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias. (arXiv:2112.06868v2 [cs.LG] UPDATED)
    Variational Autoencoders are one of the most commonly used generative models, particularly for image data. A prominent difficulty in training VAEs is data that is supported on a lower-dimensional manifold. Recent work by Dai and Wipf (2020) proposes a two-stage training algorithm for VAEs, based on a conjecture that in standard VAE training the generator will converge to a solution with 0 variance which is correctly supported on the ground truth manifold. They gave partial support for that conjecture by showing that some optima of the VAE loss do satisfy this property, but did not analyze the training dynamics. In this paper, we show that for linear encoders/decoders, the conjecture is true-that is the VAE training does recover a generator with support equal to the ground truth manifold-and does so due to an implicit bias of gradient descent rather than merely the VAE loss itself. In the nonlinear case, we show that VAE training frequently learns a higher-dimensional manifold which is a superset of the ground truth manifold.
    Doubly Robust Collaborative Targeted Learning for Debiased Recommendations. (arXiv:2203.10258v2 [cs.IR] UPDATED)
    In recommender systems, the collected data always contains various biases and leads to the challenge of accurate predictions. To address selection bias and confounding bias, the doubly robust (DR) method and its variants show superior performance due to the double robustness property and smaller bias under inaccurate propensity and error imputation models. However, we theoretically show that the variance of the error imputation-based (EIB) method is much smaller than that of DR, although EIB may suffer from a much larger bias. In this paper, we propose a doubly robust targeted learning method that effectively combines the small-bias property of DR and the small-variance property of EIB, by leveraging the targeted maximum likelihood estimation technique. Theoretical analysis shows that the proposed targeted learning is effective in reducing the variance of DR while maintaining double robustness. To further reduce the bias and variance during the training process, we propose a novel collaborative targeted learning approach that decomposes imputed errors into parametric and nonparametric parts and updates them collaboratively, resulting in more accurate predictions. Both theoretical analysis and experiments demonstrate the superiority of the proposed methods compared with existing debiasing methods.
    Marginal and Joint Cross-Entropies & Predictives for Online Bayesian Inference, Active Learning, and Active Sampling. (arXiv:2205.08766v1 [cs.LG])
    Principled Bayesian deep learning (BDL) does not live up to its potential when we only focus on marginal predictive distributions (marginal predictives). Recent works have highlighted the importance of joint predictives for (Bayesian) sequential decision making from a theoretical and synthetic perspective. We provide additional practical arguments grounded in real-world applications for focusing on joint predictives: we discuss online Bayesian inference, which would allow us to make predictions while taking into account additional data without retraining, and we propose new challenging evaluation settings using active learning and active sampling. These settings are motivated by an examination of marginal and joint predictives, their respective cross-entropies, and their place in offline and online learning. They are more realistic than previously suggested ones, building on work by Wen et al. (2021) and Osband et al. (2022), and focus on evaluating the performance of approximate BNNs in an online supervised setting. Initial experiments, however, raise questions on the feasibility of these ideas in high-dimensional parameter spaces with current BDL inference techniques, and we suggest experiments that might help shed further light on the practicality of current research for these problems. Importantly, our work highlights previously unidentified gaps in current research and the need for better approximate joint predictives.
    On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias. (arXiv:2205.09072v1 [cs.LG])
    We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with $r$ neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a network achieving perfect training accuracy and having at most $\mathcal{O}(r)$ linear regions, implying a generalization bound. Our result may already hold for mild over-parameterization, where the width is $\tilde{\mathcal{O}}(r)$ and independent of the sample size.
    A Central Limit Theorem, Loss Aversion and Multi-Armed Bandits. (arXiv:2106.05472v2 [math.PR] UPDATED)
    This paper studies a multi-armed bandit problem where the decision-maker is loss averse, in particular she is risk averse in the domain of gains and risk loving in the domain of losses. The focus is on large horizons. Consequences of loss aversion for asymptotic (large horizon) properties are derived in a number of analytical results. The analysis is based on a new central limit theorem for a set of measures under which conditional variances can vary in a largely unstructured history-dependent way subject only to the restriction that they lie in a fixed interval.
    GPU-accelerated partially linear multiuser detection for 5G and beyond URLLC systems. (arXiv:2201.05024v3 [eess.SP] UPDATED)
    In this feasibility study, we have implemented a recently proposed partially linear multiuser detection algorithm in reproducing kernel Hilbert spaces (RKHSs) on a GPU-accelerated platform. Partially linear multiuser detection, which combines the robustness of linear detection with the power of nonlinear methods, has been proposed for a massive connectivity scenario with the non-orthogonal multiple access (NOMA). This is a promising approach, but detecting payloads within a received orthogonal frequency division multiplexing (OFDM) radio frame requires the execution of a large number of inner product operations, which are the main computational burden of the algorithm. Although inner-product operations consist of simple kernel evaluations, their vast number poses a challenge in ultra-low latency (ULL) applications, because the time needed for computing the inner products might exceed the sub-millisecond latency requirement. To address this problem, this study demonstrates the acceleration of the inner-product operations through massive parallelization. The result is a GPU-accelerated real-time OFDM receiver that enables sub-millisecond latency detection to meet the requirements of 5th generation (5G) and beyond ultra-reliable and low latency communications (URLLC) systems. Moreover, the parallelization and acceleration techniques explored and demonstrated in this study can be extended to many other signal processing algorithms in Hilbert spaces, such as those based on projection onto convex sets (POCS) and adaptive projected subgradient method (APSM) algorithms. Experimental results and comparisons with the state-of-art confirm the effectiveness of our techniques.
    New Lower Bounds for Private Estimation and a Generalized Fingerprinting Lemma. (arXiv:2205.08532v2 [cs.DS] UPDATED)
    We prove new lower bounds for statistical estimation tasks under the constraint of $(\varepsilon, \delta)$-differential privacy. First, we provide tight lower bounds for private covariance estimation of Gaussian distributions. We show that estimating the covariance matrix in Frobenius norm requires $\Omega(d^2)$ samples, and in spectral norm requires $\Omega(d^{3/2})$ samples, both matching upper bounds up to logarithmic factors. We prove these bounds via our main technical contribution, a broad generalization of the fingerprinting method to exponential families. Additionally, using the private Assouad method of Acharya, Sun, and Zhang, we show a tight $\Omega(d/(\alpha^2 \varepsilon))$ lower bound for estimating the mean of a distribution with bounded covariance to $\alpha$-error in $\ell_2$-distance. Prior known lower bounds for all these problems were either polynomially weaker or held under the stricter condition of $(\varepsilon,0)$-differential privacy.
    Fair and Green Hyperparameter Optimization via Multi-objective and Multiple Information Source Bayesian Optimization. (arXiv:2205.08835v1 [cs.LG])
    There is a consensus that focusing only on accuracy in searching for optimal machine learning models amplifies biases contained in the data, leading to unfair predictions and decision supports. Recently, multi-objective hyperparameter optimization has been proposed to search for machine learning models which offer equally Pareto-efficient trade-offs between accuracy and fairness. Although these approaches proved to be more versatile than fairness-aware machine learning algorithms -- which optimize accuracy constrained to some threshold on fairness -- they could drastically increase the energy consumption in the case of large datasets. In this paper we propose FanG-HPO, a Fair and Green Hyperparameter Optimization (HPO) approach based on both multi-objective and multiple information source Bayesian optimization. FanG-HPO uses subsets of the large dataset (aka information sources) to obtain cheap approximations of both accuracy and fairness, and multi-objective Bayesian Optimization to efficiently identify Pareto-efficient machine learning models. Experiments consider two benchmark (fairness) datasets and two machine learning algorithms (XGBoost and Multi-Layer Perceptron), and provide an assessment of FanG-HPO against both fairness-aware machine learning algorithms and hyperparameter optimization via a multi-objective single-source optimization algorithm in BoTorch, a state-of-the-art platform for Bayesian Optimization.
    Meta-Learning Sparse Compression Networks. (arXiv:2205.08957v1 [stat.ML])
    Recent work in Deep Learning has re-imagined the representation of data as functions mapping from a coordinate space to an underlying continuous signal. When such functions are approximated by neural networks this introduces a compelling alternative to the more common multi-dimensional array representation. Recent work on such Implicit Neural Representations (INRs) has shown that - following careful architecture search - INRs can outperform established compression methods such as JPEG (e.g. Dupont et al., 2021). In this paper, we propose crucial steps towards making such ideas scalable: Firstly, we employ stateof-the-art network sparsification techniques to drastically improve compression. Secondly, introduce the first method allowing for sparsification to be employed in the inner-loop of commonly used Meta-Learning algorithms, drastically improving both compression and the computational cost of learning INRs. The generality of this formalism allows us to present results on diverse data modalities such as images, manifolds, signed distance functions, 3D shapes and scenes, several of which establish new state-of-the-art results.
    Ranking of Communities in Multiplex Spatiotemporal Models of Brain Dynamics. (arXiv:2203.09281v2 [q-bio.NC] UPDATED)
    As a relatively new field, network neuroscience has tended to focus on aggregate behaviours of the brain averaged over many successive experiments or over long recordings in order to construct robust brain models. These models are limited in their ability to explain dynamic state changes in the brain which occurs spontaneously as a result of normal brain function. Hidden Markov Models (HMMs) trained on neuroimaging time series data have since arisen as a method to produce dynamical models that are easy to train but can be difficult to fully parametrise or analyse. We propose an interpretation of these neural HMMs as multiplex brain state graph models we term Hidden Markov Graph Models (HMGMs). This interpretation allows for dynamic brain activity to be analysed using the full repertoire of network analysis techniques. Furthermore, we propose a general method for selecting HMM hyperparameters in the absence of external data, based on the principle of maximum entropy, and use this to select the number of layers in the multiplex model. We produce a new tool for determining important communities of brain regions using a spatiotemporal random walk-based procedure that takes advantage of the underlying Markov structure of the model. Our analysis of real multi-subject fMRI data provides new results that corroborate the modular processing hypothesis of the brain at rest as well as contributing new evidence of functional overlap between and within dynamic brain state communities. Our analysis pipeline provides a way to characterise dynamic network activity of the brain under novel behaviours or conditions.
    On the Efficiency of Entropic Regularized Algorithms for Optimal Transport. (arXiv:1906.01437v9 [cs.DS] UPDATED)
    We present several new complexity results for the entropic regularized algorithms that approximately solve the optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. First, we improve the complexity bound of a greedy variant of Sinkhorn, known as \textit{Greenkhorn}, from $\widetilde{O}(n^2\varepsilon^{-3})$ to $\widetilde{O}(n^2\varepsilon^{-2})$. Notably, our result can match the best known complexity bound of Sinkhorn and help clarify why Greenkhorn significantly outperforms Sinkhorn in practice in terms of row/column updates as observed by~\citet{Altschuler-2017-Near}. Second, we propose a new algorithm, which we refer to as \textit{APDAMD} and which generalizes an adaptive primal-dual accelerated gradient descent (APDAGD) algorithm~\citep{Dvurechensky-2018-Computational} with a prespecified mirror mapping $\phi$. We prove that APDAMD achieves the complexity bound of $\widetilde{O}(n^2\sqrt{\delta}\varepsilon^{-1})$ in which $\delta>0$ stands for the regularity of $\phi$. In addition, we show by a counterexample that the complexity bound of $\widetilde{O}(\min\{n^{9/4}\varepsilon^{-1}, n^2\varepsilon^{-2}\})$ proved for APDAGD before is invalid and give a refined complexity bound of $\widetilde{O}(n^{5/2}\varepsilon^{-1})$. Further, we develop a \textit{deterministic} accelerated variant of Sinkhorn via appeal to estimated sequence and prove the complexity bound of $\widetilde{O}(n^{7/3}\varepsilon^{-4/3})$. As such, we see that accelerated variant of Sinkhorn outperforms Sinkhorn and Greenkhorn in terms of $1/\varepsilon$ and APDAGD and accelerated alternating minimization (AAM)~\citep{Guminov-2021-Combination} in terms of $n$. Finally, we conduct the experiments on synthetic and real data and the numerical results show the efficiency of Greenkhorn, APDAMD and accelerated Sinkhorn in practice.
    Classification as Direction Recovery: Improved Guarantees via Scale Invariance. (arXiv:2205.08633v1 [stat.ML])
    Modern algorithms for binary classification rely on an intermediate regression problem for computational tractability. In this paper, we establish a geometric distinction between classification and regression that allows risk in these two settings to be more precisely related. In particular, we note that classification risk depends only on the direction of the regressor, and we take advantage of this scale invariance to improve existing guarantees for how classification risk is bounded by the risk in the intermediate regression problem. Building on these guarantees, our analysis makes it possible to compare algorithms more accurately against each other and suggests viewing classification as unique from regression rather than a byproduct of it. While regression aims to converge toward the conditional expectation function in location, we propose that classification should instead aim to recover its direction.
    The Kernelized Taylor Diagram. (arXiv:2205.08864v1 [stat.ML])
    This paper presents the kernelized Taylor diagram, a graphical framework for visualizing similarities between data populations. The kernelized Taylor diagram builds on the widely used Taylor diagram, which is used to visualize similarities between populations. However, the Taylor diagram has several limitations such as not capturing non-linear relationships and sensitivity to outliers. To address such limitations, we propose the kernelized Taylor diagram. Our proposed kernelized Taylor diagram is capable of visualizing similarities between populations with minimal assumptions of the data distributions. The kernelized Taylor diagram relates the maximum mean discrepancy and the kernel mean embedding in a single diagram, a construction that, to the best of our knowledge, have not been devised prior to this work. We believe that the kernelized Taylor diagram can be a valuable tool in data visualization.
    Exact Gaussian Processes for Massive Datasets via Non-Stationary Sparsity-Discovering Kernels. (arXiv:2205.09070v1 [stat.ML])
    A Gaussian Process (GP) is a prominent mathematical framework for stochastic function approximation in science and engineering applications. This success is largely attributed to the GP's analytical tractability, robustness, non-parametric structure, and natural inclusion of uncertainty quantification. Unfortunately, the use of exact GPs is prohibitively expensive for large datasets due to their unfavorable numerical complexity of $O(N^3)$ in computation and $O(N^2)$ in storage. All existing methods addressing this issue utilize some form of approximation -- usually considering subsets of the full dataset or finding representative pseudo-points that render the covariance matrix well-structured and sparse. These approximate methods can lead to inaccuracies in function approximations and often limit the user's flexibility in designing expressive kernels. Instead of inducing sparsity via data-point geometry and structure, we propose to take advantage of naturally-occurring sparsity by allowing the kernel to discover -- instead of induce -- sparse structure. The premise of this paper is that GPs, in their most native form, are often naturally sparse, but commonly-used kernels do not allow us to exploit this sparsity. The core concept of exact, and at the same time sparse GPs relies on kernel definitions that provide enough flexibility to learn and encode not only non-zero but also zero covariances. This principle of ultra-flexible, compactly-supported, and non-stationary kernels, combined with HPC and constrained optimization, lets us scale exact GPs well beyond 5 million data points.
    Distribution-free Prediction Sets Adaptive to Unknown Covariate Shift. (arXiv:2203.06126v3 [stat.ME] UPDATED)
    Predicting sets of outcomes -- instead of unique outcomes -- is a promising solution to uncertainty quantification in statistical learning. Despite a rich literature on constructing prediction sets with statistical guarantees, adapting to unknown covariate shift -- a prevalent issue in practice -- poses a serious challenge and has yet to be fully solved. In this paper, we propose a novel flexible distribution-free method, PredSet-1Step, to construct prediction sets that can efficiently adapt to unknown covariate shift. We formally show that our method is \textit{asymptotically probably approximately correct}, having well-calibrated coverage error with high confidence for large samples. We illustrate that it achieves nominal coverage in a number of experiments and a data set concerning HIV risk prediction in a South African cohort study. Our theory hinges on a new bound for the convergence rate of the coverage of Wald confidence intervals based on general asymptotically linear estimators. This is a technical tool of independent interest.
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v2 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). Then, we approximate the resultant GP by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). ICK is flexible and can be used to include prior information in neural networks in many applications. We demonstrate the strength of our framework by showing its superior performance and flexibility on both synthetic and real-world data sets. The code is available at: https://anonymous.4open.science/r/ICK_NNGP-17C5/.
    Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement. (arXiv:1810.09103v3 [cs.LG] UPDATED)
    Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the action-values, with the addition of entropy regularization for soft variants. In this work, we explore an alternative update for the actor, based on an extension of the cross entropy method (CEM) to condition on inputs (states). The idea is to start with a broader policy and slowly concentrate around maximal actions, using a maximum likelihood update towards actions in the top percentile per state. The speed of this concentration is controlled by a proposal policy, that concentrates at a slower rate than the actor. We first provide a policy improvement result in an idealized setting, and then prove that our conditional CEM (CCEM) strategy tracks a CEM update per state, even with changing action-values. We empirically show that our Greedy AC algorithm, that uses CCEM for the actor update, performs better than Soft AC and is much less sensitive to entropy-regularization.
    A label efficient two-sample test. (arXiv:2111.08861v3 [cs.LG] UPDATED)
    Two-sample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis). We consider a new setting for this problem where sample features are easily measured whereas sample labels are unknown and costly to obtain. Accordingly, we devise a three-stage framework in service of performing an effective two-sample test with only a small number of sample label queries: first, a classifier is trained with samples uniformly labeled to model the posterior probabilities of the labels; second, a novel query scheme dubbed \emph{bimodal query} is used to query labels of samples from both classes, and last, the classical Friedman-Rafsky (FR) two-sample test is performed on the queried samples. Theoretical analysis and extensive experiments performed on several datasets demonstrate that the proposed test controls the Type I error and has decreased Type II error relative to uniform querying and certainty-based querying. Source code for our algorithms and experimental results is available at \url{https://github.com/wayne0908/Label-Efficient-Two-Sample}.
    Model-based Clustering with Missing Not At Random Data. (arXiv:2112.10425v2 [stat.ML] UPDATED)
    Traditional ways for handling missing values are not designed for the clustering purpose and they rarely apply to the general case, though frequent in practice, of Missing Not At Random (MNAR) values. This paper proposes to embed MNAR data directly within model-based clustering algorithms. We introduce a mixture model for different types of data (continuous, count, categorical and mixed) to jointly model the data distribution and the MNAR mechanism. Eight different MNAR models are proposed, which may depend on the underlying (unknown) classes and/or the values of the missing variables themselves. We prove the identifiability of the parameters of both the data distribution and the mechanism, whatever the type of data and the mechanism, and propose an EM or Stochastic EM algorithm to estimate them. The code is available on \url{https://github.com/AudeSportisse/Clustering-MNAR}. %\url{https://anonymous.4open.science/r/Clustering-MNAR-0201} We also prove that MNAR models for which the missingness depends on the class membership have the nice property that the statistical inference can be carried out on the data matrix concatenated with the mask by considering a MAR mechanism instead. Finally, we perform empirical evaluations for the proposed sub-models on synthetic data and we illustrate the relevance of our method on a medical register, the TraumaBase$^{\mbox{\normalsize{\textregistered}}}$ dataset.
    Detecting Model Misspecification in Amortized Bayesian Inference with Neural Networks. (arXiv:2112.08866v3 [stat.ME] UPDATED)
    Recent advances in probabilistic deep learning enable amortized Bayesian inference in settings where the likelihood function is implicitly defined by a simulation program. But how faithful is such inference when simulations represent reality somewhat inaccurately? In this paper, we conceptualize the types of model misspecification arising in simulation-based inference and systematically investigate the performance of SNPE-C (APT) and the BayesFlow framework under these misspecifications. We propose an augmented optimization objective which imposes a probabilistic structure on the learned latent data summary space and utilize maximum mean discrepancy (MMD) to detect potentially catastrophic misspecifications during inference undermining the validity of the obtained results. We verify our detection criterion on a number of artificial and realistic misspecifications, ranging from toy conjugate models to complex models of decision making and disease outbreak dynamics applied to real data. Further, we show that posterior inference errors increase when the distance between the latent summary distributions of the true data-generating process and the training simulations grows. Thus, we demonstrate the dual utility of MMD as a method for detecting model misspecification and as a proxy for verifying the faithfulness of amortized simulation-based Bayesian inference.
    FiLM: Frequency improved Legendre Memory Model for Long-term Time Series Forecasting. (arXiv:2205.08897v1 [cs.LG])
    Recent studies have shown the promising performance of deep learning models (e.g., RNN and Transformer) for long-term time series forecasting. These studies mostly focus on designing deep models to effectively combine historical information for long-term forecasting. However, the question of how to effectively represent historical information for long-term forecasting has not received enough attention, limiting our capacity to exploit powerful deep learning models. The main challenge in time series representation is how to handle the dilemma between accurately preserving historical information and reducing the impact of noisy signals in the past. To this end, we design a \textbf{F}requency \textbf{i}mproved \textbf{L}egendre \textbf{M}emory model, or {\bf FiLM} for short: it introduces Legendre Polynomial projections to preserve historical information accurately and Fourier projections plus low-rank approximation to remove noisy signals. Our empirical studies show that the proposed FiLM improves the accuracy of state-of-the-art models by a significant margin (\textbf{19.2\%}, \textbf{22.6\%}) in multivariate and univariate long-term forecasting, respectively. In addition, dimensionality reduction introduced by low-rank approximation leads to a dramatic improvement in computational efficiency. We also demonstrate that the representation module developed in this work can be used as a general plug-in to improve the performance of most deep learning modules for long-term forecasting. Code will be released soon
    SoQal: Selective Oracle Questioning for Consistency Based Active Learning of Cardiac Signals. (arXiv:2004.09557v3 [cs.LG] UPDATED)
    Clinical settings are often characterized by abundant unlabelled data and limited labelled data. This is typically driven by the high burden placed on oracles (e.g., physicians) to provide annotations. One way to mitigate this burden is via active learning (AL) which involves the (a) acquisition and (b) annotation of informative unlabelled instances. Whereas previous work addresses either one of these elements independently, we propose an AL framework that addresses both. For acquisition, we propose Bayesian Active Learning by Consistency (BALC), a sub-framework which perturbs both instances and network parameters and quantifies changes in the network output probability distribution. For annotation, we propose SoQal, a sub-framework that dynamically determines whether, for each acquired unlabelled instance, to request a label from an oracle or to pseudo-label it instead. We show that BALC can outperform start-of-the-art acquisition functions such as BALD, and SoQal outperforms baseline methods even in the presence of a noisy oracle.
    Sharp asymptotics on the compression of two-layer neural networks. (arXiv:2205.08199v2 [cs.IT] UPDATED)
    In this paper, we study the compression of a target two-layer neural network with N nodes into a compressed network with M < N nodes. More precisely, we consider the setting in which the weights of the target network are i.i.d. sub-Gaussian, and we minimize the population L2 loss between the outputs of the target and of the compressed network, under the assumption of Gaussian inputs. By using tools from high-dimensional probability, we show that this non-convex problem can be simplified when the target network is sufficiently over-parameterized, and provide the error rate of this approximation as a function of the input dimension and N . For a ReLU activation function, we conjecture that the optimum of the simplified optimization problem is achieved by taking weights on the Equiangular Tight Frame (ETF), while the scaling of the weights and the orientation of the ETF depend on the parameters of the target network. Numerical evidence is provided to support this conjecture.
    Bagged Polynomial Regression and Neural Networks. (arXiv:2205.08609v1 [stat.ML])
    Series and polynomial regression are able to approximate the same function classes as neural networks. However, these methods are rarely used in practice, although they offer more interpretability than neural networks. In this paper, we show that a potential reason for this is the slow convergence rate of polynomial regression estimators and propose the use of bagged polynomial regression (BPR) as an attractive alternative to neural networks. Theoretically, we derive new finite sample and asymptotic $L^2$ convergence rates for series estimators. We show that the rates can be improved in smooth settings by splitting the feature space and generating polynomial features separately for each partition. Empirically, we show that our proposed estimator, the BPR, can perform as well as more complex models with more parameters. Our estimator also performs close to state-of-the-art prediction methods in the benchmark MNIST handwritten digit dataset.
    Bayesian Inference with Nonlinear Generative Models: Comments on Secure Learning. (arXiv:2201.09986v2 [cs.IT] UPDATED)
    Unlike the classical linear model, nonlinear generative models have been addressed sparsely in the literature. This work aims to bring attention to these models and their secrecy potential. To this end, we invoke the replica method to derive the asymptotic normalized cross entropy in an inverse probability problem whose generative model is described by a Gaussian random field with a generic covariance function. Our derivations further demonstrate the asymptotic statistical decoupling of Bayesian inference algorithms and specify the decoupled setting for a given nonlinear model. The replica solution depicts that strictly nonlinear models establish an all-or-nothing phase transition: There exists a critical load at which the optimal Bayesian inference changes from being perfect to an uncorrelated learning. This finding leads to design of a new secure coding scheme which achieves the secrecy capacity of the wiretap channel. This interesting result implies that strictly nonlinear generative models are perfectly secured without any secure coding. We justify this latter statement through the analysis of an illustrative model for perfectly secure and reliable inference.

  • Open

    [D] Adversarial testing for Fairness and Biases
    Is it worthwhile exposing ML models to the public for adversarially testing if it meets fairness conditions given that there are so many edge cases to test for when deploying ML model? (Similar to the concept of bug bounties in cybersecurity) submitted by /u/blitzkreig3 [link] [comments]
    [P] Keras Launches a Computer Vision Extension Package
    Keras has launched a computer vision extension package. ​ Links: - https://keras.io/keras_cv/ - https://github.com/keras-team/keras-cv/ submitted by /u/puppet_pals [link] [comments]
    [N] Introducing Accelerated PyTorch Training on Mac
    https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/ submitted by /u/eigenlaplace [link] [comments]  ( 1 min )
    [Discussion] Are there any better Topic Modelling algorithms/models other than LDA?
    As the title suggests, could someone point me to some new Topic Modelling algorithms that have come up recently that are in some way better? My use-case is towards modelling tweets, if that helps! TIA EDIT: Approach is location based, meaning I will be increasing the document length by merging nearby tweets. submitted by /u/mrnerdy59 [link] [comments]  ( 1 min )
    [R] How is F1-score a better metric for unbalanced data sets ?
    Alright so bear with me; I have an unbalanced data set, let's say I have a +1 class accounting for 80% of my data set and a -1 class representing the remaining 20%. I fit some kind of model and report both accuracy and F1-score. F1-score is supposed to counteract the skewed data distribution but since F1 is computed on the +1 class, which is the majority class, it still going to be biased right ? To me, the answer would be taking the average F1-score for both classes, which sklearn calls the "macro" F1 score (I'm aware there are 2 very different takes on how "macro" metrics are computed) but people on the internet seem to get very angry when we use macro F1 for binary classification. Although it sounds like a very reasonable choice to me...What do you guys think ? ​ Note: I'm doing a thesis in quantitative finance and a lot of papers use accuracy to artificially boost their models predictive power, I'm trying to prove why this shouldn't be and why macro f1 would yield much more realistic results. submitted by /u/delta9_ [link] [comments]  ( 2 min )
    [D] Those at FAANG, do you use AutoML?
    Do you use AutoML? I see Microsoft, Google, Amazon and etc... all have their own AutoML SDK/Services. submitted by /u/stevofolife [link] [comments]  ( 1 min )
    [D] KDD Notifications Thread
    I don't think I usually see the data mining conference much here. So anyway, hoping to hear that KDD might love me submitted by /u/EdwardRaff [link] [comments]
    [N] Flower Summit 2022
    On May 31, 2022, the Flower Community will come together for the second Flower Summit 2022. Join experts in the field of federated learning and find out how Flower accelerates the development of systems in both research and production scenarios. All speakers and the corresponding time schedule are final now. You can expect speakers from Intel, Google/MLCommons, Brave, University of Cambridge, AI Sweden, and many more. Block your calendar and register now: https://flower.dev/conf/flower-summit-2022/ https://preview.redd.it/1hcnl3f678091.png?width=1200&format=png&auto=webp&s=b1cc18b40435e8c9ae21d825b43a5b2143fafc2e submitted by /u/burnai [link] [comments]  ( 1 min )
    [D] MNIST equivalent dataset for RNN/LSTM/Transformer?
    Hi, I'm developing a course material and looking for a nice introductory task/dataset for RNN/LSTM/Transformers. I can use recurrent networks for MNIST too, but I'm looking for a more "classic" example such as sequence prediction or classification. Is there any? Preferably it's relatively simple, clean and well-known dataset like MNIST. Thank you very much. submitted by /u/euske [link] [comments]  ( 2 min )
    [R] Handcrafted localized phase features for human action recognition
    This paper https://paperswithcode.com/paper/handcrafted-localized-phase-features-for claims to achieve 98% top-1 accuracy on kinetics-400 and 96.35 on kinetics-700. From their description, they compute phase-correlation on large patches between consecutive frames and then use that in a knn-classifier. I didn't find any extra info in the paper about the method and frankly I find it hard to believe this beats all of the recent state-of-the art methods. What do you think? Maybe a (possibly uninteded) foul in the evaluation method? submitted by /u/AmirRosenfeld [link] [comments]  ( 1 min )
    [R] Learning the Dynamics of Physical Systems from Sparse Observations with Finite Element Networks
    submitted by /u/martenlienen [link] [comments]  ( 1 min )
    [D] Paper for Object Detection
    Recently, my team and I created a dataset with annotated labels for object detection. We plan to make a paper out of it. However, object detection is quite trivial and by that I mean that we try 3-4 approaches (SSD, Faster R-CNN, etc) with some optimization (e.g. same parameters on all approaches and fine-tune best performed?). Is it good material for a paper? I think if we present the findings as a dataset and benchmark experiments it would be sufficient. Shall we try more complex pipelines? submitted by /u/giakou4 [link] [comments]  ( 2 min )
    [D] Any recommended do's/don't for rebuttal phase ?
    Rebuttal is the stage where pre-lim reviews have been released to the authors. Now is the time for authors to address the concerns of reviewers. Do you have some essential do's/don't guidelines which you follow? It maybe about writing responses which disagree with a reviewer. Perhaps, presents results on additional experiments which were requested. submitted by /u/PaganPasta [link] [comments]  ( 1 min )
    [N] Apple Executive Who Left Over Return-to-Office Policy Joins Google AI Unit: Ian Goodfellow, a former director of machine learning at Apple, is joining DeepMind.
    According to an article published in Bloomberg, An Apple Inc. executive who left over the company’s stringent return-to-office policy is joining Alphabet Inc.’s DeepMind unit, according to people with knowledge of the matter. Ian Goodfellow, who oversaw machine learning and artificial intelligence at Apple, left the iPhone maker in recent weeks, citing the lack of flexibility in its work policies. The company had been planning to require corporate employees to work from the office on Mondays, Tuesdays and Thursdays, starting this month. That deadline was put on hold Tuesday, though. https://www.bloomberg.com/news/articles/2022-05-17/ian-goodfellow-former-apple-director-of-machine-learning-to-join-deepmind submitted by /u/hardmaru [link] [comments]  ( 2 min )
    [D] Are there any Tracking approaches using DeepSORT + Detector that isnt Yolo
    Hey, ive been wondering, all Repositories I could find on the internet are Yolo Detectors combined with DeepSort. I couldnt find others, what Im interested in the most would be Efficientdet or CentreNet combined with DeepSORT, are there any Repositories you guys know of? The last few Days I found my faszination for the Tensorflow 2 Object Detection API, I was wondering If I could use my weights from there combined with DeepSORT Best Regards submitted by /u/HolidayLobster1355 [link] [comments]  ( 1 min )
  • Open

    Google AI explains how Assistant answers your follow-up questions
    submitted by /u/Zirius_Sadfaces [link] [comments]
    The dream of destruction (made with starryai)
    submitted by /u/akhlys98 [link] [comments]
    Whats are your opinions on The Vital Intelligence the body vital scanning technology
    submitted by /u/BossBossZR [link] [comments]
    AI Dream 48 - Space Odyssey HAL 9000
    submitted by /u/LordPewPew777 [link] [comments]
    Researchers use artificial intelligence to help autonomous vehicles avoid idling at red lights.
    submitted by /u/qptbook [link] [comments]
    Glass painting decor.
    submitted by /u/cookingandcraft [link] [comments]
    This article showcases how to annotate semi- structured texts whether its pdfs or scanned images using UBIAI’s annotation tool
    submitted by /u/UBIAI [link] [comments]
    Yikes, the subbot has gone a little racist
    submitted by /u/orgeezuz [link] [comments]
    Using Deep Learning for Programming Language Translation
    submitted by /u/VikasOjha666 [link] [comments]
    Google has found a new use case for large AI language models: Job interviews.
    submitted by /u/much_successes [link] [comments]
    Purple Sunset (made with starryai)
    submitted by /u/Losthel [link] [comments]
    Your AI Weekly Digest (newsletter)! I just shared a new iteration covering BlobGAN
    submitted by /u/OnlyProggingForFun [link] [comments]
    Twitter Art Club
    submitted by /u/VIRUS-AOTOXIN [link] [comments]
    AIRS in the AIR: Modular Self-reconfigurable Robot (Session 2) Speakers: Michael Rubenstein and Tin Lun Lam
    submitted by /u/nousetest [link] [comments]  ( 1 min )
    Exploring the potential of MIDI sequence generation using Machine Learning - Survey
    I am a student currently studying Software Development at MCAST. For my degree thesis, I am exploring the potential of MIDI sequence generation using Machine Learning techniques. To evaluate the implemented algorithm, I created a questionnaire which asks respondents to rate 10 different samples. No personally identifiable information will be collected in this questionnaire. I would greatly appreciate if you can spare around 5 minutes to take part in this questionnaire. https://www.survio.com/survey/d/Y1W7D8P1X3J7F8U7Y submitted by /u/drinu98 [link] [comments]  ( 1 min )
    9 Best Artificial Intelligence books for beginners to expert to read in 2022
    submitted by /u/maneesh123456 [link] [comments]
  • Open

    Logging in Python
    Logging is a way to store information about your script and track events that occur. When writing any complex script in Python, logging is essential for debugging software as you develop it. Without logging, finding the source of a problem in your code may be extremely time consuming. After completing this tutorial, you will know: […] The post Logging in Python appeared first on Machine Learning Mastery.  ( 22 min )
  • Open

    DALL·E 2 Research Preview Update
    Early users have created over 3 million images to date and helped us improve our safety processes. We're excited to begin adding up to 1,000 new users from our waitlist each week.  ( 1 min )
  • Open

    How to concatenate feature vectors of different length?
    I have a very basic question. In an architecture like the one in the attached image, how do you concretely concatenate the feature vector coming from the conv nets and the attention WITH the feature vector coming from the scalar observation (direction, position, etc.)? One of them might be something like a 1x10000 vector, while the other two might be way shorter, say 1x4 for the direction and 1x2 for the position ​ https://arxiv.org/abs/2104.07750 submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Generative Trajectory Modelling : a "complete shift" in the Reinforcement Learning paradigm.
    submitted by /u/moschles [link] [comments]  ( 1 min )
    Installing & Using MuJoCo 2.1.5 with OpenAi Gym
    Hi all, I finally got my environment set up with MuJoCo and now I would like to use it through OpenAI Gym to train some agents for. I was reading that before deepmind took it over the installation process was very annoying. Is it still that way or is there an easier solution meanwhile? Following the instruction on https://github.com/openai/mujoco-py just with the 2.1.5 version and I get following error when I try to import mujoco_py ​ Also generally asked: Is there a better alternative to OpenAI Gym to work with the newer Mujoco versions? I'ld be very thankfull for any tipps :) ​ https://preview.redd.it/i47fibtpw7091.png?width=1113&format=png&auto=webp&s=7f6300a24b75e2c9d7938859b6de35982a502801 submitted by /u/disdisinform [link] [comments]  ( 2 min )
    10 Best Deep Reinforcement Learning Courses
    submitted by /u/MlTut [link] [comments]
    Replacing Model Based Predictive Control with RL, does anyone have any resources that I can look into for such an activity... for now I'm just thinking about using an active MPC for "training".
    submitted by /u/veezion123 [link] [comments]  ( 1 min )
    Double DQN algorithms converge on only one action.
    I have taken some reference implementations of DDQN algorithm and am trying to create an agent which can trade in the forex market. Unfortunately from the 2nd trial onwards (after training the DDQN for the first time) , the probability distribution of the actions converges on only action and the loss and the reward loss fluctuates. Dataset - 13k Batch_size - 64 Update_rl - 6 Learning rate - 0.001 Gamma - 0.99 Reward - -1 to 1(depend upon profit and loss) submitted by /u/laxuu [link] [comments]  ( 1 min )
    How to Leverage Reinforcement Learning • Phil Winder & Rebecca Nugent
    submitted by /u/goto-con [link] [comments]
    Are there any text-based game environments for RL agents to train on similar to Textworld or Jericho?
    submitted by /u/cz1xrnvz [link] [comments]  ( 1 min )
  • Open

    Use Amazon Lex to capture street addresses
    Amazon Lex provides automatic speech recognition (ASR) and natural language understanding (NLU) technologies to transcribe user input, identify the nature of their request, and efficiently manage conversations. Lex lets you create sophisticated conversations, streamline your user experience to improve customer satisfaction (CSAT) scores, and increase containment in your contact centers. Natural, effective customer interactions require […]  ( 10 min )
  • Open

    Part 2: Is Data Mesh Fool’s Gold? Not if You Avoid the Traps
    Wow, my blog “Is Data Mesh Fool’s Gold? Creating a Business-centric Data Strategy” created quite a stir.  And that was my intention. I actually believe that the Data Mesh is an important data management and governance framework (yes, the Data Mesh is more of a framework than a technology) for helping organizations deliver a business-driven… Read More »Part 2: Is Data Mesh Fool’s Gold? Not if You Avoid the Traps The post Part 2: Is Data Mesh Fool’s Gold? Not if You Avoid the Traps appeared first on Data Science Central.  ( 5 min )
  • Open

    Vector-Quantized Image Modeling with Improved VQGAN
    Posted by Jiahui Yu, Senior Research Scientist, and Jing Yu Koh, Research Software Engineer, Google Research In recent years, natural language processing models have dramatically improved their ability to learn general-purpose representations, which has resulted in significant performance gains for a wide range of natural language generation and natural language understanding tasks. In large part, this has been accomplished through pre-training language models on extensive unlabeled text corpora. This pre-training formulation does not make assumptions about input signal modality, which can be language, vision, or audio, among others. Several recent papers have exploited this formulation to dramatically improve image generation results through pre-quantizing images into discrete integer …  ( 7 min )
  • Open

    How to Make Oracle AIs Safe
    The Counterfactual Oracle AI Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 4 min )
  • Open

    Living better with algorithms
    Graduate student Sarah Cen explores the interplay between humans and artificial intelligence systems, to help build accountability and trust.  ( 7 min )

  • Open

    [N] Gradio Blocks + Hugging Face event is starting this week. A hackathon type event from May 17th to May 31st with prizes in which we will create interactive web demos for state-of-the-art machine learning models
    We are happy to invite you to the Gradio Blocks Party - a community event in which we will create interactive demos for state-of-the-art machine learning models. Demos are powerful because they allow anyone — not just ML engineers — to try out models in the browser, give feedback on predictions, identify trustworthy models. The event will take place from May 17th to 31st. We will be organizing this event on Huggingface: https://huggingface.co/Gradio-Blocks and the Hugging Face discord channel. Prizes will be given at the end of the event, see the Prizes section We will be building demos using the new Gradio Blocks API. Blocks allows you to build web-based demos in a flexible way using the Gradio library. Gradio is a popular choice for building demos for machine learning models, as it allows you to create web-based UIs all in Python. For example, here is a UI for Dall-E Mini using Gradio Blocks: ​ https://reddit.com/link/ury6a9/video/p8m2arag24091/player ​ submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 1 min )
    Are there any research groups that try to find quantitative laws of deep learning? [D]
    I’m a graduate student in a deep learning application area. How model development works is that you have a recipe of tricks and guidelines you play around with and cross-validate different versions of until you find one that works well. Most research trying to improve accuracy on proxy datasets seems a waste of time to me and I’m more interested in fundamental patterns in learning. There is this wealth of experiments the community is producing but the theoretical insight is nothing more than a bunch of rules of thumb. I feel like it’s time to split apart experimental and theoretical deep learning like experimental and theoretical Physics co-exists and try and figure out quantitative laws of deep learning. I get that deep learning is a non convex optimisation problem and that we can’t prove much, but what we can do is try and figure laws out. Newtons law of gravity was also not proven from some others axioms. While deep learning does not seem fully amenable to mathematical analysis, it is to physical analysis. I’m asking for some theories, not theorems. Any one know of any work on this or work in progress? I’m not talking about physics-informed deep learning, DL applied to physics or physical theory used to improve DL; just to be clear. Work I would find interesting would try to, from the wealth of experimental DL data, uncover fundamental, quantitative relationships between architecture, optimisation procedure, dataset and performance. submitted by /u/lemlo100 [link] [comments]  ( 2 min )
    [D]Practical correlation metric for a large number of vectors
    I am dealing with a timeseries consisting of input flow sampled every 5 minutes over 441 days. My aim is to find any possible correlation from data coming from: The same day of the week The same moment in time ( EX. 14:35,14:40 etc) I proceeded to sample according to weekdays and hours. Then I computed the 63x63 correlation matrix for each of the weekdays and a 441x441 for each hour, which in the second case is pretty impractical. I feel like the dataset is too broad to categorise it with a simple yes-or-no question yet that's what I've been tasked with.So my question is if I can try to do autocorrelation and if it results in some parameter p and q in ARIMA model or would you suggest me another more succinct approach that may give a broader picture of data? submitted by /u/Ambitious-Donut1321 [link] [comments]  ( 3 min )
    [D] What are the best information retrieval methods today?
    Given the advent of powerful language models using transformers, what IR systems are most effective today? Would love to read recent papers and blogs about the same! submitted by /u/evilBotman [link] [comments]  ( 1 min )
    [D] Is representative training data always a good thing?
    https://preview.redd.it/65gejwx5q2091.png?width=1000&format=png&auto=webp&s=107c54464a27b005bf139eabd405134dafe94d15 More like this at: https://www.evilaicartoons.com/ submitted by /u/HAIL-9000 [link] [comments]
    [D] Could you technically use continuous numeric labels instead of one-hot labels for multiclass classification?
    This would work almost like an embedding for label classes, every class corresponds to a vector of continuous numeric values. You could identify the correct class based on a nearest neighbouring label vector for a given inference. The only problem I see is how you would initialize the label embedding before training. One advantage I see is that you could add classes without retraining the network. submitted by /u/mrwafflezzz [link] [comments]  ( 1 min )
    [D] Which software packages are used for evaluation of automatic summarization besides ROUGE?
    Is there any easily findable software besides ROUGE for this purpose? submitted by /u/Goldback_Gorilla [link] [comments]
    [D] Understanding audio sampling function for a speech synthesis WGAN
    Hello, I've recently made a post on this subreddit asking for clarifications on a certain paper describing an implementation of WGAN-GP for speech synthesis from silent videos. Those answers were all really helpful in better understanding the learning process, however more have cropped up as I began digging deeper. I'm currently attempting training a hybrid model between the architectures described in these two papers, with a generator and objective function from the former and the critic and PASE blocks from the latter. After not seeing any signs of convergence after 12 hours of training (~20 epochs on 4 speakers from the GRID corpus, roughly 8000 seconds of data), I read further and found some info regarding the data sampling in both articles. The first one states: The audio clips …  ( 2 min )
    [D] Estimation of marginal likelihood - what's the SOTA
    Hi! I face a problem in which I should estimate the marginal likelihood. My differentiable model is of moderate dimensionality (circa 50) and I do have a decent sample from the posterior. The model is expensive to calculate/propagate and it's implemented in an AD framework. I checked some of the works in the field, many of them seem to be developed for cosmological data. I am aware of nested sampling, subdomain methods of M. D. Weinberg and Delaunay triangulation approaches. Still, as I completely fresh to the field, I don't know what to expect w.r.t. accuracy of the methods. What is your experience? Do you know of any decent comparison studies or other techniques worth exploring? submitted by /u/msusik [link] [comments]  ( 1 min )
    [D] How to choose dimensions for latent space ?
    Hi, I want to make a clustering on bioinformatics data. I have a matrix of 129 samples and 3000 features (OTUs). I was thinking as a preprocessing step to reduce the dimensions of this Matrix and see if I can achieve better clustering. I thought of using Autoencoders to get the latent representation of the data and then apply probably k-means and see the results. How can I choose the latent space dimensions ? I have read some papers where they do the same but they don't explain how they choose the dimensions. submitted by /u/grisp98 [link] [comments]  ( 5 min )
  • Open

    Ensemble Reinforcement Learning
    Could you point me to some papers that use ensembling techniques to improve efficiency of RL algorithms please? Is this an area that is actively researched? If so, what is the SOTA for it? submitted by /u/SirRantcelot [link] [comments]  ( 1 min )
    How to teach agent not to take invalid actions?
    Hi, I'm working in an environment where there are 3 possible actions, but some of them are sometimes invalid/impossible, depending on the state . I am giving the key piece of information about the current state (binary 0/1) to the agent in the observation space (among other info), in hopes that it would figure out that (when state=0 , I cannot take action 1 but when state=1, I cannot take action 3). To do this, I have tried: Give a huge penalty upon using an invalid move and ending the episode Give a huge penalty upon using an invalid move, not changing the state and allowing episode to continue Neither of these seemed to have any effect, as the agent continued making invalid moves. Any advice or insight would be greatly appreciated! I'm pretty new to RL. Thank you! submitted by /u/VladimirB-98 [link] [comments]  ( 2 min )
    Simulating random RGB images and observation space for RL model
    Hi, I'm a bit confused about the usage of RGB images in RL. I'm trying to create random tensors (that are meant to imitate a visual observation and an observation space) to feed in my model and see whether it works correctly. What I don't understand is: concretely, how would I randomly initialize a tensor for the observation space (for the init method) and one for the observation (for the forward method)? Would this be a realistic observation? And how would I initialize the obs space? obs = torch.randn(4, 64, 64, 3) Second question: as far as I understood, the first dimension with PyTorch is the batch size. What does it represent exactly? submitted by /u/No_Possibility_7588 [link] [comments]  ( 2 min )
    It is ok to roast me, but I need help about my project.
    I know that connect 4 is solved, but I am trying to make a value based method rl agents to play connect 4 with MARL. I trained both agents many times and the policy always converge to a very poor situation. They just commit and rush 4 at one of the column, not trying to block opponent, and the agent who go first always win after training a long time. here is my code. Please give me suggestions to improve and I know that MCTS is a good for this case, but I havent learn it yet, currently focusing on value based method, and also is eligibility traces dead? Note: I tried ANN(flatten the state), and it converge to stupid strategy faster, now I am using CNN(also get stupid policy) submitted by /u/Professional_Card176 [link] [comments]  ( 2 min )
    Observation vector comprising only of previous action and reward: Isn't that a multi-armed bandits problem?
    Hello redditors of RL, I am doing joint research on RL and Wireless Comms. and I am observing a trend in a lot of the problem formulations people use there: Sometimes, the observation vector of the "MDP" is defined as simply containing the past action and reward (usually without any additional information). Given that all algorithms collect experience tuples of (s, a, r, s'), would you agree with the following statements? Assuming a discrete action space, if st contains only [at-1,rt-1] , isn't that the same as having no observations? Since you already have this information in your experience tuple. Taking it a step further, isn't that a multi-armed bandits scenario? I.e. assuming the stochastic process that generates the rewards is stationary, the optimal "policy" essentially selects always one action. This is not an MDP (or rather, it is "trivially" an MDP), won't you agree? Even if st includes other information, isn't the incorporation of [at-1,rt-1] simply unnecessary? Assuming continuous action space, couldn't this problem be treated similar to the (discrete) multi-armed bandits problem, as long as you adopt a parametric model for learning the distributions of the rewards conditioned on the actions? submitted by /u/SomeParanoidAndroid [link] [comments]  ( 2 min )
    Sustainability applications
    Is there any recent paper or work done applying RL in the field of sustainability and sustainable development? submitted by /u/blitzkreig3 [link] [comments]
  • Open

    Contextual Rephrasing in Google Assistant
    Posted by Aurelien Boffy, Senior Staff Software Engineer, and Roberto Pieraccini, Engineering Director, Google Assistant When people converse with one another, context and references play a critical role in driving their conversation more efficiently. For instance, if one asks the question “Who wrote Romeo and Juliet?” and, after receiving an answer, asks “Where was he born?”, it is clear that ‘he’ is referring to William Shakespeare without the need to explicitly mention him. Or if someone mentions “python” in a sentence, one can use the context from the conversation to determine whether they are referring to a type of snake or a computer language. If a virtual assistant cannot robustly handle context and references, users would be required to adapt to the limitation of the technology by…  ( 7 min )
  • Open

    Can artificial intelligence overcome the challenges of the health care system?
    MIT and Mass General Brigham researchers and physicians connect in person to bring AI into mainstream health care.  ( 6 min )
    On the road to cleaner, greener, and faster driving
    Researchers use artificial intelligence to help autonomous vehicles avoid idling at red lights.  ( 7 min )
  • Open

    How much money does one day intensive use of openai cost ?
    submitted by /u/huberpaul [link] [comments]  ( 1 min )
    Microsoft AI Team Introduces “Federated Learning Utilities and Tools for Experimentation” (FLUTE): A High-Performance Open-Source Platform For Federated Learning Research And Offline Simulations
    Distributed Training (DT), which focuses on scaling the model training process via model or data parallelism, has gotten much interest because of an increase in training datasets. On the other hand, DT makes some assumptions, particularly in terms of communication and network parameters. Furthermore, new data management restrictions are arising due to the growing requirement for personal data protection, making data more inaccessible due to storage behind firewalls or on users’ devices without the option of being shared for centralized training. Federated Learning is a decentralized machine learning approach that emphasizes collaborative training and data privacy for users. The central concept underlying federated learning is that these machine learning models are highly versatile when it comes to training sophisticated models over large amounts of data without having to share that data with a centralized body. Despite its popularity as a research topic, it is challenging to deploy since it differs significantly from typical machine learning pipelines. Local data variety, end-node hardware diversity, privacy concerns, and optimization limits are challenges in federated learning. Furthermore, federated learning applications frequently need to extend the learning process to millions of clients to imitate a real-world environment. These difficulties highlight the necessity for a simulation platform that allows researchers and developers to conduct proof-of-concept implementations and verify performance before creating and deploying their machine learning models. Quick Read Paper: https://arxiv.org/pdf/2203.13789.pdf Github: https://github.com/microsoft/msrflute submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Overview of XGBoost and Gradient Boosting
    submitted by /u/aidev2040 [link] [comments]
    "I used AI to generate a music video from song lyrics" by [DoodleChos]
    submitted by /u/VegiHarry [link] [comments]  ( 1 min )
    I've finally launched MagicNote - a new platform where you can write better in your second language with GPT-3
    Hello Guys, I'm happy to announce I've finally launched MagicNote - a new platform where you can write better on demand for your work or business in your second language using gpt-3. https://magicnote.ai It was a wild ride getting here, but I finally made it after 10-11 months of hard work. The majority of my time was spent validating the idea, not the product. I've done my best to make my AI writing assistant as affordable as possible – it's one of the cheapest on the market. For me, this is more about making a difference in people’s lives than making money. To be honest, I developed it primarily for myself and have been still using it. Communication with colleagues, clients, potential customers, or employees matters, especially if it is in your foreign language writing. MagicNote helps me improve the quality of my communication with such people while writing on demand. Just thought it could be beneficial to thousands of other people like me. Feel free to join the platform if you are one of them. Looking forward to your feedback. submitted by /u/data-gig [link] [comments]  ( 1 min )
    This video is part of my entry to the Future of Life Institute's Worldbuild AI project to envision a positive future with AGI in 2045. I'd love feedback!
    submitted by /u/Turil [link] [comments]  ( 1 min )
    [English Speakers] [Feedback] [Survey] [Populism] [AI] I would greatly appreciate your feedback on my survey
    Hello! For my Bachelor's thesis, I am examining the effects of populism on people's attitudes. Before I send out my survey for data collection I would like to ask for some of your feedback with the hopes of improving my questions and stimulus material. I would greatly appreciate it if you could take 5-7 minutes to fill out this survey and then let me know what you liked and disliked about the study. Thank you for your help! https://uvacommscience.eu.qualtrics.com/jfe/form/SV_3w9GzcvxQ7fWRzo submitted by /u/Psychological_Face69 [link] [comments]  ( 1 min )
    Javascript vs Python for variety of uses (AI, App Development, etc…)
    I currently know Java basics well (I have completed AP CS A) but the language doesn’t completely suit my needs. I am currently in the summer prior to doing my bachelors in Computer Science. Over the next few months, I would like to begin learning Artificial intelligence and I would like to first learn the syntax of the language I’ll use. Along with AI, I’d like to have the freedom of programming apps on both IOS & Android, and also maybe get into web development (maybe make basic websites where I can apply my knowledge in AI, create a personal portfolio, etc…). I would like for the language that I will mainly use to be easily used in things like app and web development along with AI in order for me to be able to apply the things I learn in AI through personal projects. This is really important for me because as I was learning Java for school, I was extremely limited. For example, Java app development kept me stuck with android and I couldn't approach AI with Java. I’m leaning more towards JavaScript as it’s a language I’ve never tried but I am not completely sure what to choose that will be the most suiting. submitted by /u/obvslynot [link] [comments]  ( 2 min )
    26 Images Created by Dall-E 2
    submitted by /u/kbf_ [link] [comments]
    Nvidia Canvas is still really cool for anyone waiting for DALL E 2
    submitted by /u/BeginningRealistic49 [link] [comments]
    The shell plugin I wrote writes your git commands
    submitted by /u/tomd_96 [link] [comments]  ( 1 min )
  • Open

    Overview of XGBoost and Gradient Boosting
    submitted by /u/aidev2040 [link] [comments]
    Machine Learning with Harsh
    submitted by /u/mr-minion [link] [comments]  ( 1 min )
  • Open

    Mission Made Possible: Real-Time Rendering Helps Studio Create Cinematic Battle Between Characters From ‘Diablo Immortal’
    Real-time rendering is helping one studio take virtual production to impossible heights. In their latest project, the creators at Los Angeles-based company Impossible Objects were tasked with depicting an epic battle between characters from the upcoming video game, Diablo Immortal. But the showdown had to take place on the surface of a Google Pixel phone, Read article > The post Mission Made Possible: Real-Time Rendering Helps Studio Create Cinematic Battle Between Characters From ‘Diablo Immortal’ appeared first on NVIDIA Blog.  ( 3 min )
    AI on the Ball: Startup Shoots Computer Vision to the Soccer Pitch
    Eyal Ben-Ari just took his first shot on a goal of bringing professional-class analytics to amateur soccer players. The CEO of startup Track160, in Tel Aviv, has seen his company’s AI-powered sports analytics software tested and used in the big leagues. Now he’s turning his attention to underserved amateurs in the clubs and community teams Read article > The post AI on the Ball: Startup Shoots Computer Vision to the Soccer Pitch appeared first on NVIDIA Blog.  ( 3 min )
    Concept Artist Pablo Muñoz Gómez Enlivens Fantasy Creatures ‘In the NVIDIA Studio’
    Concept artist Pablo Muñoz Gómez dives In the NVIDIA Studio this week, showcasing artwork that depicts a fantastical myth. Gómez, a creator based in Australia, is equally passionate about helping digital artists, teaching 3D classes and running the Zbrush guides website with his creative specialties: concept and character artistry. The post Concept Artist Pablo Muñoz Gómez Enlivens Fantasy Creatures ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    Customize pronunciation using lexicons in Amazon Polly
    Amazon Polly is a text-to-speech service that uses advanced deep learning technologies to synthesize natural-sounding human speech. It is used in a variety of use cases, such as contact center systems, delivering conversational user experiences with human-like voices for automated real-time status check, automated account and billing inquiries, and by news agencies like The Washington […]  ( 7 min )
  • Open

    Chemical element abbreviation patterns
    I’ve wondered occasionally about the patterns in how chemical elements are abbreviated. If you don’t know the abbreviation for an element, is there a simple algorithm that would let you narrow the range of possibilities or improve your odds at guessing? Here’s a survey of how the elements are abbreviated. Latin and German The elements […] Chemical element abbreviation patterns first appeared on John D. Cook.  ( 3 min )
  • Open

    How to Use an AI Story Generator to Write Your Stories
    PS: This entire article was written by an AI story generator: Jasper AI.  ( 4 min )
  • Open

    AI Harms are Societal, Not Just Individual
    Not just Individual, but Societal Harms Case Study: Privacy and surveillance Case Study: Disinformation and erosion of trust Individual Harms, Individual Solutions Parallels with Environmental Harms Directions Forward Not just Individual, but Societal Harms When the USA government switched to facial identification service ID.me for unemployment benefits, the software failed to recognize Bill Baine’s face. While the app said that he could have a virtual appointment to be verified instead, he was unable to get through. The screen had a wait time of 2 hours and 47 minutes that never updated, even over the course of weeks. He tried calling various offices, his daughter drove in from out of town to spend a day helping him, and yet he was never able to get a useful human answer on what he was su…  ( 7 min )

  • Open

    [D] How to nominate NeurIPS 2022 reviewer?
    I think last year they have a link to nominate/self-nominate reviewer but I couldn't find any this year. Do any of you know how to do it? submitted by /u/Deep_forgetting [link] [comments]
    [D] Anyone still using Stochastic Depth?
    I recently studied ways to improve the training time of big neural networks, especially ResNets. On my way, I could not help but notice the big claims you can find in the paper Deep Networks with Stochastic Depth To summarize informally, their contribution consists of a new hyperparameter for ResBlocks, which is used to skip the inner part of the residual connection with the given probability (they use 0.5 in their experiments). Quoting from the Paper: Let b ∈ {0, 1} denote a Bernoulli random variable [...] If b = 1, eq. (2) reduces to the original ResNet update and this ResBlock remains unchanged. If b = 0, the ResBlock reduces to the identity function. They do not only claim a big improvement on training time Following the calculations above, approximately 25% of training time …  ( 2 min )
    [R] Survey on Misuse of NLP Research
    We at CopeNLU and the Digital Democracies Institute are currently running an online survey on the potential harms and misuses of Natural Language Processing technologies and research. We, therefore, ask researchers in the field of natural language processing to fill out the following survey to give us an insight into their concerns. We would really appreciate it if you could take a few minutes to fill out the survey. The survey takes about 20 minutes to complete and is available here: copenlu.limesurvey.net/987789 submitted by /u/frimelle [link] [comments]  ( 1 min )
    [P] A quick tip on DataFrame.apply
    I have been using pandas for years in many projects. But I still feel pain when applying a function to multiple columns. So recently developed a tool for our project, making life easier for data scientists like me. pandas.DataFrame we used everyday: ```python create a dataframe df = pd.DataFrame({'a': range(5)}) a 0 0 1 1 2 2 3 3 4 4 ``` By wrapping the pandas.DataFrame with Towhee, we have runas_op as an alternative to DataFrame.apply ``` data collection is a wrapper for dataframe dc = towhee.from_df(df) dc.runas_op['a', 'b'](func=lambda x: x+1) a b 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 ``` runas_op['a', 'b'](func=...) tells towhee to use column a as input and column b as output. For multiple inputs functions, we can use a tuple to specify input columns: ```python dc.runas_op[('a', 'b'), 'c'](func=lambda x, y: x + y) a b c 0 0 1 1 1 1 2 3 2 2 3 5 3 3 4 7 4 4 5 9 ``` We can also use a tuple for multiple outputs: ```python apply a multiple output function dc.runas_op['c', ('d', 'e')](func=lambda x: (x+1, x-1)) a b c d e 0 0 1 1 2 0 1 1 2 3 4 2 2 2 3 5 6 4 3 3 4 7 8 6 4 4 5 9 10 8 ``` Towhee provides method-chaining style API, making the code easy to follow: python df = pd.DataFrame({'a': range(5)}) dc = towhee.from_df(df) \ .runas_op['a', 'b'](func=lambda x: x+1) \ .runas_op[('a', 'b'), 'c'](func=lambda x, y: x + y) \ .runas_op['c', ('d', 'e')](func=lambda x: (x+1, x-1)) To convert the data back (as pandas.DataFrame): ```python new_df = dc.df type(new_df) pandas.core.frame.DataFrame ``` The project's homepage is https://github.com/towhee-io/towhee, and you can find more about towhee by going through the documents. Would appreciate some feedback and contribution 🙂 submitted by /u/ok-reiase [link] [comments]  ( 2 min )
    [D] Colab GPU Assignment Algorithm
    Does anyone have a sense of how Colab determines what GPU you get? I have a Pro+ subscription and for a period of a month or two I was regularly getting A100s. As a result I wrote some software that required 40GB of memory. Now I can't seem to ever get an A100. Why can't Colab give any information on this issue Additionally I created a new Colab account to hopefully get the 'new user' bump. After 24 hours on an A100 the new account can't even get a V100 ​ submitted by /u/anon135797531 [link] [comments]  ( 1 min )
    [D] ML Datasets with instance interactions
    Hello, I'm looking for datasets that contain the interaction label of 2 instances, in order to test a DL model. Any type of dataset that has lines like would be a great fit! I was thinking that I could also "handcraft" such datasets, from a classification dataset, where I put the interaction of 2 instances from the same class to be 1, and the interaction of 2 instances from different classes to be 0. However, I'd like to discover dedicated datasets with interactions instead. If you are aware of such datasets, could you please point me to them? Thank you in advance! submitted by /u/Yuu_Aky [link] [comments]  ( 1 min )
    [N] TorchRL: PyTorch pre-release RL library is here!
    PyTorch ecosystem team has opensourced TorchRL, the RL dedicated PyTorch library. It's still WIP and it hasn't been officially released yet, but it's already good enough to be used in common research settings, including online / offline, on-policy / off-policy, meta-RL and such. It is quite efficient for a series of tasks: for model ensembling and meta-RL it leverages functorch's capabilities. Some functions are highly optimized to efficiently run on cuda (e.g. TD(lambda) returns). Examples currently include SAC, DDPG, PPO, REDQ and DQN. Let us know what you think of it, issues and PRs are welcome! Buy it, use it, break it, fix it... Doc and tutorials to come soon! submitted by /u/AdCool8270 [link] [comments]  ( 1 min )
    [D] Why do top speech/audio conferences like ICASSP and Interspeech have very high acceptance rates like 46%-48% ?
    I have heard from my fellow Ph.D students and post-docs that the lower the acceptance rate of conferences, the higher their reputation. I see that this is true for Neurips, ICLR, ICML, ACL etc. But ICASSP and Interspeech are considered top conferences in speech/audio applications. So why are their acceptance rates so high? submitted by /u/Far_Conversation_445 [link] [comments]  ( 3 min )
    [D] Using U-Net for 1D segmentation from 2D images ?
    Hi all ! ​ I am working on building a scanner to digitize film strips using a rectangular area sensor. This means that by using a rectangular sensor, I am lacking the entire context of the film during capture, but still need to align the images properly to the sensor so the resolution can be maximized with automatic capture and film alignment. ​ The issue at hand is then to segment the captured images between "image on the film", and "gap between images" areas. ​ I have tried using traditional CV methods but issues arise with poorly exposed films, since images can become very clear, pretty much as clear as the gaps between images, with only some sparse image elements visible. This is why I believe that the full context of the 2D image is necessary for segmentation. ​ Since the images captured always have the same orientation, and gaps between images are always vertical, I believe that a 1D output would be sufficient, aka "iiiigggiiiiiii" where "g" is a gap and "i" is the image area. The segmented areas should *always* span the full height of the image and have vertical borders if the output of the segmentation is a 2D image, hence why I believe that a 1D output is enough. ​ Is this something that can be done with a U-Net ? I am very much a beginner with this but seeing how it can be used for outputting 2Dx1mono images from 2Dx3 color inputs, I was wondering if this could be done. ​ Thanks a lot in advance ! submitted by /u/iAmTheAlchemist [link] [comments]  ( 2 min )
    [D] Is it possible to submit a pure math paper to NeurIPS?
    I have some theoretical results on a popular algorithm in ML (score-based generative models). But I am from a maths background. All I do is state these results and prove them. The proofs are not extremely deep or involved, but they are rigorous (so maybe a bit pedantic) and would require the reviewer to actually know some stochastic analysis. I do some small numerical experiments on toy data sets in 2 dimensions to illustrate the results. Are there any tips on how I could maximize my chances of such a paper getting published in NeurIPS? Or are my chances very low? submitted by /u/future_gcp_poweruser [link] [comments]  ( 3 min )
    [D] Using Inception and FID scores in training?
    Is it possible to use the Inception and FID scores in the training of a deep image generation model, i.e. to maximize the scores in a loss function, albeit this is "cheating"? If so, has anyone / any paper done it? Thanks for any pointer. submitted by /u/thanrl [link] [comments]  ( 1 min )
    [News] New Google tech - Geospatial API uses computer vision and machine learning to turn 15 years of street view imagery into a 3d canvas for augmented reality developers
    submitted by /u/imaginfinity [link] [comments]  ( 4 min )
  • Open

    Batch size in Spinnigup' PPO?
    Proximal Policy Optimization — Spinning Up documentation (openai.com) I'm trying to understand the code in SpinningUp's implementation of PPO. It seems that when training the policy and value function networks, they used a batch size of 4000, instead of a minibatch. Does anyone know why? Wouldn't a minibatch behave better? submitted by /u/Traditional-Brother9 [link] [comments]  ( 1 min )
    Entity Gym: A new entity based API for reinforcement learning environments
    submitted by /u/Programmierer [link] [comments]  ( 1 min )
    Can n step learning deal with sparse reward?
    submitted by /u/Professional_Card176 [link] [comments]
    "Emergent bartering behaviour in multi-agent reinforcement learning", Johanson et al 2022
    submitted by /u/gwern [link] [comments]  ( 1 min )
    How do you choose the size of your observation space when you create a custom environment?
    Perhaps naive question, but this is the first time that I create a custom env: how do you choose the size of, for example, the RGB image you're passing as an observation to the agent? It could be a 24x24x3 image, or it could be a 200x200x3 one, so based on what principle should I choose it? submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    PPO - Log Std as trainable parameter?
    In many implementations of the PPO algorithm I see the STD of the policy distribution implemented as a learnable parameter. For example here: https://intellabs.github.io/coach/components/agents/policy_optimization/ppo.html However it seems like, this parameter does not change very much during the course of the training indepently what it's start value was. I would expect it to converge to some common value no matter how I set it initially. Otherwise it is just dependent on it's initial value, which I find very hard to tune. How do you implement the STD of the policy distribution? And how do you set its inital value? I would be glad for any recommendations... https://preview.redd.it/xhss11tuquz81.png?width=354&format=png&auto=webp&s=1a9b0f9b4240ca61fd05fb130bd0ccc5da85f8b9 submitted by /u/flxh13 [link] [comments]  ( 1 min )
    n-step TD vs λ-return? (performance)
    submitted by /u/Professional_Card176 [link] [comments]
    What's the point of the actor in actor critic
    Since the actor seems to be dictated by the crictic I'm not quite sure what the point of the actor is. Ofcourse the actor acts out the action, but could you not just let the critic decide the action by for example taking the highest Q value of converting the Q values to probabilities if you would want a stochastic approach. submitted by /u/Jobdriaan [link] [comments]  ( 2 min )
    Training a RL model to fit a curve
    I'm attempting to train a RL model to fit a curve and would appreciate your input to improve the performance. TL;DR: If anyone has an open source environment that does something similar to curve fitting, I would appreciate having a look at it! My setup is as follows: We have a true curve y_true, evaluated at regular intervals on some domain x. We also have a method to generate a simulated curve y_sim on the same domain x. The method for simulating the curve depends on a set of parameters. The action space is a box of the same dimension as the number of parameters for the curve simulation method. The parameters are then updated with each action as: params += 0.01*action (the actions range between -1 and 1, so 0.01 makes it such that the actions don't modify the parameters to drastically …  ( 3 min )
    TorchRL: PyTorch pre-release RL library is alive!
    PyTorch ecosystem team has opensourced TorchRL, the RL dedicated PyTorch library. It's still WIP and it hasn't been officially released yet, but it's already good enough to be used in common research settings, including online / offline, on-policy / off-policy, meta-RL and such. It is quite efficient for a series of tasks: for model ensembling and meta-RL it leverages functorch's capabilities. Some functions are highly optimized to efficiently run on cuda (e.g. TD(lambda) returns). Examples currently include SAC, DDPG, PPO, REDQ and DQN. Let us know what you think of it, issues and PRs are welcome! Buy it, use it, break it, fix it... Doc and tutorials to come soon! submitted by /u/AdCool8270 [link] [comments]  ( 2 min )
    Are the unsupervised RL experiments carried out correctly?
    Hi. I recently studied Unsupervised RL. It pre-trains agents with task-agnostic rewards and then fine-tunes downstream tasks. But All I saw at URLB that is unsupervised RL benchmark was worse than scratch. In my case, I didn't check all cases because of limited resource, I wonder your case. I can't get clear answer from repository owner and author. walker run and stand state-based action repeat 1 pretrain 2M batchsize 1024, finetune batchsize 256 scratch >> apt-icm walker run and stand pixel-based action repeat 2 pretrain 2M batchsize 1024, finetune batchsize 256 scratch > apt-icm ​ Thank you for reading. submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 1 min )
  • Open

    Lessons From Deploying Deep Learning To Production
    submitted by /u/regalalgorithm [link] [comments]
    Last Week in AI: Enzyme developed with AI to decompose plastic, Hugging Face reaches $2 billion valuation, new ambitious EU AI Act, and more!
    submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    Breakthrough Google Deepmind Gato General AI Does 600+ Tasks | AI Robot Arm To Disarm Bombs
    submitted by /u/SlightSituation [link] [comments]
    AI Evolution: The Historic Timeline of AI Milestones
    Most folks think Artificial Intelligence (AI) is a novel notion, although it's been around for a long time. We went back in history and curated a list of all key artificial intelligence breakthroughs that have enabled us to live our current lives. Read more submitted by /u/ridamughal110 [link] [comments]
    AI Accelerators - Hardware For Artificial Intelligence
    CPUs were not as powerful and efficient a few decades ago when it came to running large computations for machine learning. Hardware manufacturers have worked hard to create a processing unit capable of performing any AI operation. Read more submitted by /u/ridamughal110 [link] [comments]
    Inworld AI's Ilya Gelfenbeyn and Kylan Gibbs talked about the tech behind their new developer platform for building AI-driven virtual characters and shared how to integrate their solution into games.
    submitted by /u/80lv [link] [comments]  ( 1 min )
    Can large language models be democratized?
    submitted by /u/bendee983 [link] [comments]
    Is it possible to program AI to detect cringe?
    submitted by /u/SimonCZ2 [link] [comments]
    A short story written by the AI - " Henry The Mighty Time Traveler "
    submitted by /u/Alive_Ad_2882 [link] [comments]
    1:37 / 22:17 • Paper Motivation Is Gato Really the Future of AI?
    submitted by /u/bartturner [link] [comments]
    AI poetry in rhyme proving difficult
    Hi, has anyone managed to get an AI to produce poetry in rhyme? I have tried to teach GPT-3 and Jurassic-1 to rhyme in ABAB style, after giving it around 50 example verses. Here is my prompt: "Human: You are AI, a brilliant AI poet. You write deep, beautiful and meaningful poetry in ABAB style. That means that in each 4 line verse, the last word in line 1 rhymes with the last word in line 3, and the last word in line 2 rhymes with the last word in line 4. Here are some examples of ABAB poems. Study how they are structured. Afterwards I will as you to write some ABAB poems like these:" Then I gave all examples. Then I ended with: "Human: That was all the examples. Please write some nice poetry in ABAB style like those poems. Each verse has to be 4 lines. AI: " Tried various tweaks to the prompt, but it really struggles to comprehend rhyme. I have seen perhaps two lines rhyming here and there, but it might be chance. I am using the largest models. Has anyone had any luck getting the models to rhyme? submitted by /u/Dinosaur-Owl [link] [comments]  ( 1 min )
    AI Implementation roadmap
    Take a look at AI implementation roadmap with real case examples https://www.toolbox.com/tech/artificial-intelligence/guest-article/ai-implementation-what-does-it-take-to-adopt-artificial-intelligence-in-business/#SocialMediaInterests submitted by /u/lklimusheuskaja [link] [comments]
    Craft Album.����
    submitted by /u/cookingandcraft [link] [comments]
    How to Leverage Conversational Chatbots to 10x Your E-commerce Sales?
    submitted by /u/mihircontra20 [link] [comments]
    Introduction to OpenCV and Image Processing with Python
    submitted by /u/RubiksCodeNMZ [link] [comments]
    Google maps immersive view - uses AI and computer vision to fuse billions of images with real-time traffic and weather, creating a 3d simulation of the world that shows you the vibe of a place
    submitted by /u/imaginfinity [link] [comments]  ( 2 min )
    Cambridge AI Researchers Propose ‘MAGIC’: A Training-Free Framework That Plugs Visual Controls Into The Generation Of A Language Model
    The release of Generative Pretrained Transformer (GPT-2) has fetched huge attention towards generative language models (LMs), which are pre-trained on massive amounts of unstructured text and have generated efficient results on a variety of NLP applications. LMs can produce texts constantly utilizing a textual prompt’s next-token prediction decoding approach. Models such as CLIP and ALIGN, pre-trained image-text joint embedding approaches, have revived multimodal illustration learning of text and images. Accordingly, it is challenging to integrate the benefits of pre-trained LMs and image-text embedding models to generate visually grounded text. The traditional approaches are generally limited by the object detectors trained with a fixed set of labels. Currently, the ZeroCap approach is ut…  ( 2 min )
    Are There Any Good Entirely Free Text-to-Image AI Generators that have API's?
    I have used wombo and it works really well, but there is no API for it. Are there any other Free Text-to-Image generators that have an API? submitted by /u/xbftw [link] [comments]
  • Open

    BREAKTHROUGH Google Deepmind Gato General AI Does 600+ Tasks With One Transformer Neural Network
    submitted by /u/getrich_or_diemining [link] [comments]
    Social Media networks as aggregate neural networks to explain political polarization and radicalization?
    With recent USA tragedies I find myself wondering if the advancing science of neural networks and machine learning, deep learning etc.. could be used to model or explain how social networks could function as a sort of aggregate neural network of a classical neural network "learning" and "training" that can also gain new nodes but are "trained" to conform to a certain pathway that literally breeds extremist views and polarization? I searched on JSTOR, etc.. but was wondering if anyone in the community knew someone or some research going on in this vein? submitted by /u/1nvent [link] [comments]  ( 1 min )
    Suggestion Regarding ML(Product Selector/Recommender) task
    Hello All, I am a student working on a B2B project called “Digital Product Selector based on questions”. So for example, if you go to a health care company’s website and you want to find out which product among them suits you best. You would have a set of questions on the website and based upon your answers, it would recommend you a product or products of that specific company. We have a static/rule based algorithm working fine when there are less products and less questions to answer for a specific company. However, for a company which has huge product list and more than 20 set of questions, the algorithm take significant amount of time to produce recommended products since being rule based and stuck in calculations. Now, I want to replace the rule based/static algorithm with any machine learning algorithm that I can train using my data. Please answer the following. 1) Can you please recommend me if there are any pre trained Neural Networks that I can use to address this use case? 2) Also, which problem statement does this use case belongs to? 3) we have recently started to work with AWS, if there are any AWS services available to address this use case, please recommend. submitted by /u/mubashir_ali93 [link] [comments]  ( 1 min )
    Breaking into the black box of artificial intelligence
    submitted by /u/nickb [link] [comments]
  • Open

    Mental secure hash function
    A few years ago I wrote about Manual Blum’s proposed method for mentally computing a secure hash function. He proposed using this method as a password manager, using the hash of a web site’s name as the password for the site. I first wrote about Blum’s method on the Heidelberg Laureate Forum blog, then wrote […] Mental secure hash function first appeared on John D. Cook.  ( 5 min )
  • Open

    Personalize your machine translation results by using fuzzy matching with Amazon Translate
    A person’s vernacular is part of the characteristics that make them unique. There are often countless different ways to express one specific idea. When a firm communicates with their customers, it’s critical that the message is delivered in a way that best represents the information they’re trying to convey. This becomes even more important when […]  ( 8 min )
  • Open

    FLUTE: A scalable federated learning simulation platform
    Federated learning has become a major area of machine learning (ML) research in recent years due to its versatility in training complex models over massive amounts of data without the need to share that data with a centralized entity. However, despite this flexibility and the amount of research already conducted, it’s difficult to implement due […] The post FLUTE: A scalable federated learning simulation platform appeared first on Microsoft Research.  ( 6 min )
  • Open

    A robust approach for deep neural networks in presence of label noise: relabelling and filtering instances during training. (arXiv:2109.03748v2 [cs.LG] UPDATED)
    Deep learning has outperformed other machine learning algorithms in a variety of tasks, and as a result, it is widely used. However, like other machine learning algorithms, deep learning, and convolutional neural networks (CNNs) in particular, perform worse when the data sets present label noise. Therefore, it is important to develop algorithms that help the training of deep networks and their generalization to noise-free test sets. In this paper, we propose a robust training strategy against label noise, called RAFNI, that can be used with any CNN. This algorithm filters and relabels instances of the training set based on the predictions and their probabilities made by the backbone neural network during the training process. That way, this algorithm improves the generalization ability of the CNN on its own. RAFNI consists of three mechanisms: two mechanisms that filter instances and one mechanism that relabels instances. In addition, it does not suppose that the noise rate is known nor does it need to be estimated. We evaluated our algorithm using different data sets of several sizes and characteristics. We also compared it with state-of-the-art models using the CIFAR10 and CIFAR100 benchmarks under different types and rates of label noise and found that RAFNI achieves better results in most cases.
    Automatic Monitoring of Fruit Ripening Rooms by UHF RFID Sensor Network and Machine Learning. (arXiv:2204.12415v2 [eess.SY] UPDATED)
    Accelerated ripening through the exposure of fruits to controlled environmental conditions and gases is nowadays one of the most assessed food technologies, especially for climacteric and exotic products. However, a fine granularity control of the process and consequently of the quality of the goods is still missing, so the management of the ripening rooms is mainly based on qualitative estimations only. Following the modern paradigms of Industry 4.0, this contribution proposes a non-destructive RFID-based system for the automatic evaluation of the live ripening of avocados. The system, coupled with a properly trained automatic classification algorithm based on Support Vector Machines (SVMs), can discriminate the stage of ripening with an accuracy greater than 85%.
    Intrinsically Motivated Self-supervised Learning in Reinforcement Learning. (arXiv:2106.13970v2 [cs.LG] UPDATED)
    In vision-based reinforcement learning (RL) tasks, it is prevalent to assign auxiliary tasks with a surrogate self-supervised loss so as to obtain more semantic representations and improve sample efficiency. However, abundant information in self-supervised auxiliary tasks has been disregarded, since the representation learning part and the decision-making part are separated. To sufficiently utilize information in auxiliary tasks, we present a simple yet effective idea to employ self-supervised loss as an intrinsic reward, called Intrinsically Motivated Self-Supervised learning in Reinforcement learning (IM-SSR). We formally show that the self-supervised loss can be decomposed as exploration for novel states and robustness improvement from nuisance elimination. IM-SSR can be effortlessly plugged into any reinforcement learning with self-supervised auxiliary objectives with nearly no additional cost. Combined with IM-SSR, the previous underlying algorithms achieve salient improvements on both sample efficiency and generalization in various vision-based robotics tasks from the DeepMind Control Suite, especially when the reward signal is sparse.
    PerfectDou: Dominating DouDizhu with Perfect Information Distillation. (arXiv:2203.16406v4 [cs.AI] UPDATED)
    As a challenging multi-player card game, DouDizhu has recently drawn much attention for analyzing competition and collaboration in imperfect-information games. In this paper, we propose PerfectDou, a state-of-the-art DouDizhu AI system that dominates the game, in an actor-critic framework with a proposed technique named perfect information distillation. In detail, we adopt a perfect-training-imperfect-execution framework that allows the agents to utilize the global information to guide the training of the policies as if it is a perfect information game and the trained policies can be used to play the imperfect information game during the actual gameplay. To this end, we characterize card and game features for DouDizhu to represent the perfect and imperfect information. To train our system, we adopt proximal policy optimization with generalized advantage estimation in a parallel training paradigm. In experiments we show how and why PerfectDou beats all existing AI programs, and achieves state-of-the-art performance.
    Accelerating Part-Scale Simulation in Liquid Metal Jet Additive Manufacturing via Operator Learning. (arXiv:2202.03665v1 [physics.flu-dyn] CROSS LISTED)
    Predicting part quality for additive manufacturing (AM) processes requires high-fidelity numerical simulation of partial differential equations (PDEs) governing process multiphysics on a scale of minimum manufacturable features. This makes part-scale predictions computationally demanding, especially when they require many small-scale simulations. We consider drop-on-demand liquid metal jetting (LMJ) as an illustrative example of such computational complexity. A model describing droplet coalescence for LMJ may include coupled incompressible fluid flow, heat transfer, and phase change equations. Numerically solving these equations becomes prohibitively expensive when simulating the build process for a full part consisting of thousands to millions of droplets. Reduced-order models (ROMs) based on neural networks (NN) or k-nearest neighbor (kNN) algorithms have been built to replace the original physics-based solver and are computationally tractable for part-level simulations. However, their quick inference capabilities often come at the expense of accuracy, robustness, and generalizability. We apply an operator learning (OL) approach to learn a mapping between initial and final states of the droplet coalescence process for enabling rapid and accurate part-scale build simulation. Preliminary results suggest that OL requires order-of-magnitude fewer data points than a kNN approach and is generalizable beyond the training set while achieving similar prediction error.
    Emotion Intensity and its Control for Emotional Voice Conversion. (arXiv:2201.03967v2 [cs.SD] UPDATED)
    Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.
    Uncertify: Attacks Against Neural Network Certification. (arXiv:2108.11299v3 [cs.LG] UPDATED)
    A key concept towards reliable, robust, and safe AI systems is the idea to implement fallback strategies when predictions of the AI cannot be trusted. Certifiers for neural networks have made great progress towards provable robustness guarantees against evasion attacks using adversarial examples. These methods guarantee for some predictions that a certain class of manipulations or attacks could not have changed the outcome. For the remaining predictions without guarantees, the method abstains from making a prediction and a fallback strategy needs to be invoked, which is typically more costly, less accurate, or even involves a human operator. While this is a key concept towards safe and secure AI, we show for the first time that this strategy comes with its own security risks, as such fallback strategies can be deliberately triggered by an adversary. In particular, we conduct the first systematic analysis of training-time attacks against certifiers in practical application pipelines, identifying new threat vectors that can be exploited to degrade the overall system. Using these insights, we design two backdoor attacks against network certifiers, which can drastically reduce certified robustness. For example, adding 1% poisoned data during training is sufficient to reduce certified robustness by up to 95 percentage points, effectively rendering the certifier useless. We analyze how such novel attacks can compromise the overall system's integrity or availability. Our extensive experiments across multiple datasets, model architectures, and certifiers demonstrate the wide applicability of these attacks. A first investigation into potential defenses shows that current approaches are insufficient to mitigate the issue, highlighting the need for new, more specific solutions.
    MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound. (arXiv:2201.02639v4 [cs.CV] UPDATED)
    As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.
    Neurochaos Feature Transformation and Classification for Imbalanced Learning. (arXiv:2205.06742v1 [cs.NE])
    Learning from limited and imbalanced data is a challenging problem in the Artificial Intelligence community. Real-time scenarios demand decision-making from rare events wherein the data are typically imbalanced. These situations commonly arise in medical applications, cybersecurity, catastrophic predictions etc. This motivates development of learning algorithms capable of learning from imbalanced data. Human brain effortlessly learns from imbalanced data. Inspired by the chaotic neuronal firing in the human brain, a novel learning algorithm namely \emph{Neurochaos Learning} (NL) was recently proposed. NL is categorized in three blocks: Feature Transformation, Neurochaos Feature Extraction (CFX), and Classification. In this work, the efficacy of neurochaos feature transformation and extraction for classification in imbalanced learning is studied. We propose a unique combination of neurochaos based feature transformation and extraction with traditional ML algorithms. The explored datasets in this study revolve around medical diagnosis, banknote fraud detection, environmental applications and spoken-digit classification. In this study, experiments are performed in both high and low training sample regime. In the former, five out of nine datasets have shown a performance boost in terms of macro F1-score after using CFX features. The highest performance boost obtained is $\textbf{25.97\%}$ for {\it Statlog (Heart)} dataset using CFX+Decision Tree. In the low training sample regime (from just one to nine training samples per class), the highest performance boost of $\textbf{144.38\%}$ is obtained for {\it Haberman's Survival} dataset using CFX+Random Forest. NL offers enormous flexibility of combining CFX with any ML classifier to boost its performance, especially for learning tasks with limited and imbalanced data.
    Scaling the weight parameters in Markov logic networks and relational logistic regression models. (arXiv:2103.15140v2 [cs.AI] UPDATED)
    We consider Markov logic networks and relational logistic regression as two fundamental representation formalisms in statistical relational artificial intelligence that use weighted formulas in their specification. However, Markov logic networks are based on undirected graphs, while relational logistic regression is based on directed acyclic graphs. We show that when scaling the weight parameters with the domain size, the asymptotic behaviour of a relational logistic regression model is transparently controlled by the parameters, and we supply an algorithm to compute asymptotic probabilities. We also show using two examples that this is not true for Markov logic networks. We also discuss using several examples, mainly from the literature, how the application context can help the user to decide when such scaling is appropriate and when using the raw unscaled parameters might be preferable. We highlight random sampling as a particularly promising area of application for scaled models and expound possible avenues for further research.
    No Weighted-Regret Learning in Adversarial Bandits with Delays. (arXiv:2103.04550v2 [cs.LG] UPDATED)
    Consider a scenario where a player chooses an action in each round $t$ out of $T$ rounds and observes the incurred cost after a delay of $d_{t}$ rounds. The cost functions and the delay sequence are chosen by an adversary. We show that in a non-cooperative game, the expected weighted ergodic distribution of play converges to the set of coarse correlated equilibria if players use algorithms that have "no weighted-regret" in the above scenario, even if they have linear regret due to too large delays. For a two-player zero-sum game, we show that no weighted-regret is sufficient for the weighted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves an expected regret of $O\left(nT^{\frac{3}{4}}+\sqrt{n}T^{\frac{1}{3}}D^{\frac{1}{3}}\right)$ and the EXP3 algorithm with $K$ arms achieves an expected regret of $O\left(\sqrt{\log K\left(KT+D\right)}\right)$ even when $D=\sum_{t=1}^{T}d_{t}$ and $T$ are unknown. These bounds use a novel doubling trick that, under mild assumptions, provably retains the regret bound for when $D$ and $T$ are known. Using these bounds, we show that FKM and EXP3 have no weighted-regret even for $d_{t}=O\left(t\log t\right)$. Therefore, algorithms with no weighted-regret can be used to approximate a CCE of a finite or convex unknown game that can only be simulated with bandit feedback, even if the simulation involves significant delays.
    secml: A Python Library for Secure and Explainable Machine Learning. (arXiv:1912.10013v2 [cs.LG] UPDATED)
    We present \texttt{secml}, an open-source Python library for secure and explainable machine learning. It implements the most popular attacks against machine learning, including test-time evasion attacks to generate adversarial examples against deep neural networks and training-time poisoning attacks against support vector machines and many other algorithms. These attacks enable evaluating the security of learning algorithms and the corresponding defenses under both white-box and black-box threat models. To this end, \texttt{secml} provides built-in functions to compute security evaluation curves, showing how quickly classification performance decreases against increasing adversarial perturbations of the input data. \texttt{secml} also includes explainability methods to help understand why adversarial attacks succeed against a given model, by visualizing the most influential features and training prototypes contributing to each decision. It is distributed under the Apache License 2.0 and hosted at \url{https://github.com/pralab/secml}.
    On the validity of pre-trained transformers for natural language processing in the software engineering domain. (arXiv:2109.04738v2 [cs.SE] UPDATED)
    Transformers are the current state-of-the-art of natural language processing in many domains and are using traction within software engineering research as well. Such models are pre-trained on large amounts of data, usually from the general domain. However, we only have a limited understanding regarding the validity of transformers within the software engineering domain, i.e., how good such models are at understanding words and sentences within a software engineering context and how this improves the state-of-the-art. Within this article, we shed light on this complex, but crucial issue. We compare BERT transformer models trained with software engineering data with transformers based on general domain data in multiple dimensions: their vocabulary, their ability to understand which words are missing, and their performance in classification tasks. Our results show that for tasks that require understanding of the software engineering context, pre-training with software engineering data is valuable, while general domain models are sufficient for general language understanding, also within the software engineering domain.
    Research on the correlation between text emotion mining and stock market based on deep learning. (arXiv:2205.06675v1 [q-fin.ST])
    This paper discusses how to crawl the data of financial forums such as stock bar, and conduct emotional analysis combined with the in-depth learning model. This paper will use the Bert model to train the financial corpus and predict the Shenzhen stock index. Through the comparative study of the maximal information coefficient (MIC), it is found that the emotional characteristics obtained by applying the BERT model to the financial corpus can be reflected in the fluctuation of the stock market, which is conducive to effectively improve the prediction accuracy. At the same time, this paper combines in-depth learning with financial texts to further explore the impact mechanism of investor sentiment on the stock market through in-depth learning, which will help the national regulatory authorities and policy departments to formulate more reasonable policies and guidelines for maintaining the stability of the stock market.
    Artificial Intelligence-Assisted Optimization and Multiphase Analysis of Polygon PEM Fuel Cells. (arXiv:2205.06768v1 [cs.NE])
    This article presents new PEM fuel cell models with hexagonal and pentagonal designs. After observing cell performance improvement in these models, we optimized them. Inlet pressure and temperature were used as input parameters, and consumption and output power were the target parameters of the multi-objective optimization algorithm. Then we used artificial intelligence techniques, including deep neural networks and polynomial regression, to model the data. Next, we employed the RSM (Response Surface Method) method to derive the target functions. Furthermore, we applied the NSGA-II multi-objective genetic algorithm to optimize the targets. Compared to the base model (Cubic), the optimized Pentagonal and Hexagonal models averagely increase the output current density by 21.819% and 39.931%, respectively.
    Principal-Agent Hypothesis Testing. (arXiv:2205.06812v1 [cs.GT])
    Consider the relationship between the FDA (the principal) and a pharmaceutical company (the agent). The pharmaceutical company wishes to sell a product to make a profit, and the FDA wishes to ensure that only efficacious drugs are released to the public. The efficacy of the drug is not known to the FDA, so the pharmaceutical company must run a costly trial to prove efficacy to the FDA. Critically, the statistical protocol used to establish efficacy affects the behavior of a strategic, self-interested pharmaceutical company; a lower standard of statistical evidence incentivizes the pharmaceutical company to run more trials for drugs that are less likely to be effective, since the drug may pass the trial by chance, resulting in large profits. The interaction between the statistical protocol and the incentives of the pharmaceutical company is crucial to understanding this system and designing protocols with high social utility. In this work, we discuss how the principal and agent can enter into a contract with payoffs based on statistical evidence. When there is stronger evidence for the quality of the product, the principal allows the agent to make a larger profit. We show how to design contracts that are robust to an agent's strategic actions, and derive the optimal contract in the presence of strategic behavior.
    Kronecker Decomposition for Knowledge Graph Embeddings. (arXiv:2205.06560v1 [cs.LG])
    Knowledge graph embedding research has mainly focused on learning continuous representations of entities and relations tailored towards the link prediction problem. Recent results indicate an ever increasing predictive ability of current approaches on benchmark datasets. However, this effectiveness often comes with the cost of over-parameterization and increased computationally complexity. The former induces extensive hyperparameter optimization to mitigate malicious overfitting. The latter magnifies the importance of winning the hardware lottery. Here, we investigate a remedy for the first problem. We propose a technique based on Kronecker decomposition to reduce the number of parameters in a knowledge graph embedding model, while retaining its expressiveness. Through Kronecker decomposition, large embedding matrices are split into smaller embedding matrices during the training process. Hence, embeddings of knowledge graphs are not plainly retrieved but reconstructed on the fly. The decomposition ensures that elementwise interactions between three embedding vectors are extended with interactions within each embedding vector. This implicitly reduces redundancy in embedding vectors and encourages feature reuse. To quantify the impact of applying Kronecker decomposition on embedding matrices, we conduct a series of experiments on benchmark datasets. Our experiments suggest that applying Kronecker decomposition on embedding matrices leads to an improved parameter efficiency on all benchmark datasets. Moreover, empirical evidence suggests that reconstructed embeddings entail robustness against noise in the input knowledge graph. To foster reproducible research, we provide an open-source implementation of our approach, including training and evaluation scripts as well as pre-trained models in our knowledge graph embedding framework (https://github.com/dice-group/dice-embeddings).
    Graph Attention Networks for Channel Estimation in RIS-assisted Satellite IoT Communications. (arXiv:2104.00735v3 [cs.NI] UPDATED)
    Direct-to-satellite (DtS) communication has gained importance recently to support globally connected Internet of things (IoT) networks. However, relatively long distances of densely deployed satellite networks around the Earth cause a high path loss. In addition, since high complexity operations such as beamforming, tracking and equalization have to be performed in IoT devices partially, both the hardware complexity and the need for high-capacity batteries of IoT devices increase. The reconfigurable intelligent surfaces (RISs) have the potential to increase the energy-efficiency and to perform complex signal processing over the transmission environment instead of IoT devices. But, RISs need the information of the cascaded channel in order to change the phase of the incident signal. This study evaluates the pilot signal as a graph and incorporates this information into the graph attention networks (GATs) to track the phase relation through pilot signaling. Proposed GAT based channel estimation method examines the performance of the DtS IoT networks for different RIS configurations to solve the challenging channel estimation problem. It is shown that the proposed GAT both demonstrates a higher performance with increased robustness under changing conditions and has lower computational complexity compared to conventional deep learning methods. Moreover, bit error rate performance is investigated for RIS designs with discrete and non-uniform phase shifts under channel estimation based on the proposed method. One of the findings in this study is that the channel models of the operating environment and the performance of the channel estimation method must be considered during RIS design to exploit performance improvement as far as possible.
    On the Importance of Architecture and Feature Selection in Differentially Private Machine Learning. (arXiv:2205.06720v1 [cs.CR])
    We study a pitfall in the typical workflow for differentially private machine learning. The use of differentially private learning algorithms in a "drop-in" fashion -- without accounting for the impact of differential privacy (DP) noise when choosing what feature engineering operations to use, what features to select, or what neural network architecture to use -- yields overly complex and poorly performing models. In other words, by anticipating the impact of DP noise, a simpler and more accurate alternative model could have been trained for the same privacy guarantee. We systematically study this phenomenon through theory and experiments. On the theory front, we provide an explanatory framework and prove that the phenomenon arises naturally from the addition of noise to satisfy differential privacy. On the experimental front, we demonstrate how the phenomenon manifests in practice using various datasets, types of models, tasks, and neural network architectures. We also analyze the factors that contribute to the problem and distill our experimental insights into concrete takeaways that practitioners can follow when training models with differential privacy. Finally, we propose privacy-aware algorithms for feature selection and neural network architecture search. We analyze their differential privacy properties and evaluate them empirically.
    Explaining by Removing: A Unified Framework for Model Explanation. (arXiv:2011.14878v2 [cs.LG] UPDATED)
    Researchers have proposed a wide variety of model explanation approaches, but it remains unclear how most methods are related or when one method is preferable to another. We describe a new unified class of methods, removal-based explanations, that are based on the principle of simulating feature removal to quantify each feature's influence. These methods vary in several respects, so we develop a framework that characterizes each method along three dimensions: 1) how the method removes features, 2) what model behavior the method explains, and 3) how the method summarizes each feature's influence. Our framework unifies 26 existing methods, including several of the most widely used approaches: SHAP, LIME, Meaningful Perturbations, and permutation tests. This newly understood class of explanation methods has rich connections that we examine using tools that have been largely overlooked by the explainability literature. To anchor removal-based explanations in cognitive psychology, we show that feature removal is a simple application of subtractive counterfactual reasoning. Ideas from cooperative game theory shed light on the relationships and trade-offs among different methods, and we derive conditions under which all removal-based explanations have information-theoretic interpretations. Through this analysis, we develop a unified framework that helps practitioners better understand model explanation tools, and that offers a strong theoretical foundation upon which future explainability research can build.
    Variational Hyper-Encoding Networks. (arXiv:2005.08482v2 [stat.ML] UPDATED)
    We propose a framework called HyperVAE for encoding distributions of distributions. When a target distribution is modeled by a VAE, its neural network parameters \theta is drawn from a distribution p(\theta) which is modeled by a hyper-level VAE. We propose a variational inference using Gaussian mixture models to implicitly encode the parameters \theta into a low dimensional Gaussian distribution. Given a target distribution, we predict the posterior distribution of the latent code, then use a matrix-network decoder to generate a posterior distribution q(\theta). HyperVAE can encode the parameters \theta in full in contrast to common hyper-networks practices, which generate only the scale and bias vectors as target-network parameters. Thus HyperVAE preserves much more information about the model for each task in the latent space. We discuss HyperVAE using the minimum description length (MDL) principle and show that it helps HyperVAE to generalize. We evaluate HyperVAE in density estimation tasks, outlier detection and discovery of novel design classes, demonstrating its efficacy.
    Distributed Transmission Control for Wireless Networks using Multi-Agent Reinforcement Learning. (arXiv:2205.06800v1 [cs.LG])
    We examine the problem of transmission control, i.e., when to transmit, in distributed wireless communications networks through the lens of multi-agent reinforcement learning. Most other works using reinforcement learning to control or schedule transmissions use some centralized control mechanism, whereas our approach is fully distributed. Each transmitter node is an independent reinforcement learning agent and does not have direct knowledge of the actions taken by other agents. We consider the case where only a subset of agents can successfully transmit at a time, so each agent must learn to act cooperatively with other agents. An agent may decide to transmit a certain number of steps into the future, but this decision is not communicated to the other agents, so it the task of the individual agents to attempt to transmit at appropriate times. We achieve this collaborative behavior through studying the effects of different actions spaces. We are agnostic to the physical layer, which makes our approach applicable to many types of networks. We submit that approaches similar to ours may be useful in other domains that use multi-agent reinforcement learning with independent agents.
    Goal-Guided Neural Cellular Automata: Learning to Control Self-Organising Systems. (arXiv:2205.06806v1 [cs.NE])
    Inspired by cellular growth and self-organization, Neural Cellular Automata (NCAs) have been capable of "growing" artificial cells into images, 3D structures, and even functional machines. NCAs are flexible and robust computational systems but -- similarly to many other self-organizing systems -- inherently uncontrollable during and after their growth process. We present an approach to control these type of systems called Goal-Guided Neural Cellular Automata (GoalNCA), which leverages goal encodings to control cell behavior dynamically at every step of cellular growth. This approach enables the NCA to continually change behavior, and in some cases, generalize its behavior to unseen scenarios. We also demonstrate the robustness of the NCA with its ability to preserve task performance, even when only a portion of cells receive goal information.
    Provably Safe Reinforcement Learning: A Theoretical and Experimental Comparison. (arXiv:2205.06750v1 [cs.LG])
    Ensuring safety of reinforcement learning (RL) algorithms is crucial for many real-world tasks. However, vanilla RL does not guarantee safety for an agent. In recent years, several methods have been proposed to provide safety guarantees for RL. To the best of our knowledge, there is no comprehensive comparison of these provably safe RL methods. We therefore introduce a categorization for existing provably safe RL methods, and present the theoretical foundations for both continuous and discrete action spaces. Additionally, we evaluate provably safe RL on an inverted pendulum. In the experiments, it is shown that indeed only provably safe RL methods guarantee safety.
    Precise Change Point Detection using Spectral Drift Detection. (arXiv:2205.06507v1 [cs.LG])
    The notion of concept drift refers to the phenomenon that the data generating distribution changes over time; as a consequence machine learning models may become inaccurate and need adjustment. In this paper we consider the problem of detecting those change points in unsupervised learning. Many unsupervised approaches rely on the discrepancy between the sample distributions of two time windows. This procedure is noisy for small windows, hence prone to induce false positives and not able to deal with more than one drift event in a window. In this paper we rely on structural properties of drift induced signals, which use spectral properties of kernel embedding of distributions. Based thereon we derive a new unsupervised drift detection algorithm, investigate its mathematical properties, and demonstrate its usefulness in several experiments.
    Transformation-Interaction-Rational Representation for Symbolic Regression. (arXiv:2205.06807v1 [cs.NE])
    Symbolic Regression searches for a function form that approximates a dataset often using Genetic Programming. Since there is usually no restriction to what form the function can have, Genetic Programming may return a hard to understand model due to non-linear function chaining or long expressions. A novel representation called Interaction-Transformation was recently proposed to alleviate this problem. In this representation, the function form is restricted to an affine combination of terms generated as the application of a single univariate function to the interaction of selected variables. This representation obtained competing solutions on standard benchmarks. Despite the initial success, a broader set of benchmarking functions revealed the limitations of the constrained representation. In this paper we propose an extension to this representation, called Transformation-Interaction-Rational representation that defines a new function form as the rational of two Interaction-Transformation functions. Additionally, the target variable can also be transformed with an univariate function. The main goal is to improve the approximation power while still constraining the overall complexity of the expression. We tested this representation with a standard Genetic Programming with crossover and mutation. The results show a great improvement when compared to its predecessor and a state-of-the-art performance for a large benchmark.
    Interlock-Free Multi-Aspect Rationalization for Text Classification. (arXiv:2205.06756v1 [cs.CL])
    Explanation is important for text classification tasks. One prevalent type of explanation is rationales, which are text snippets of input text that suffice to yield the prediction and are meaningful to humans. A lot of research on rationalization has been based on the selective rationalization framework, which has recently been shown to be problematic due to the interlocking dynamics. In this paper, we show that we address the interlocking problem in the multi-aspect setting, where we aim to generate multiple rationales for multiple outputs. More specifically, we propose a multi-stage training method incorporating an additional self-supervised contrastive loss that helps to generate more semantically diverse rationales. Empirical results on the beer review dataset show that our method improves significantly the rationalization performance.
    Embodied-Symbolic Contrastive Graph Self-Supervised Learning for Molecular Graphs. (arXiv:2205.06783v1 [cs.LG])
    Dual embodied-symbolic concept representations are the foundation for deep learning and symbolic AI integration. We discuss the use of dual embodied-symbolic concept representations for molecular graph representation learning, specifically with exemplar-based contrastive self-supervised learning (SSL). The embodied representations are learned from molecular graphs, and the symbolic representations are learned from the corresponding Chemical knowledge graph (KG). We use the Chemical KG to enhance molecular graphs with symbolic (semantic) knowledge and generate their augmented molecular graphs. We treat a molecular graph and its semantically augmented molecular graph as exemplars of the same semantic class, and use the pairs as positive pairs in exemplar-based contrastive SSL.
    Self-Sampling for Neural Point Cloud Consolidation. (arXiv:2008.06471v3 [cs.GR] UPDATED)
    We introduce a novel technique for neural point cloud consolidation which learns from only the input point cloud. Unlike other point upsampling methods which analyze shapes via local patches, in this work, we learn from global subsets. We repeatedly self-sample the input point cloud with global subsets that are used to train a deep neural network. Specifically, we define source and target subsets according to the desired consolidation criteria (e.g., generating sharp points or points in sparse regions). The network learns a mapping from source to target subsets, and implicitly learns to consolidate the point cloud. During inference, the network is fed with random subsets of points from the input, which it displaces to synthesize a consolidated point set. We leverage the inductive bias of neural networks to eliminate noise and outliers, a notoriously difficult problem in point cloud consolidation. The shared weights of the network are optimized over the entire shape, learning non-local statistics and exploiting the recurrence of local-scale geometries. Specifically, the network encodes the distribution of the underlying shape surface within a fixed set of local kernels, which results in the best explanation of the underlying shape surface. We demonstrate the ability to consolidate point sets from a variety of shapes, while eliminating outliers and noise.
    Exploring the structure-property relations of thin-walled, 2D extruded lattices using neural networks. (arXiv:2205.06761v1 [cs.LG])
    This paper investigates the structure-property relations of thin-walled lattices under dynamic longitudinal compression, characterized by their cross-sections and heights. These relations elucidate the interactions of different geometric features of a design on mechanical response, including energy absorption. We proposed a combinatorial, key-based design system to generate different lattice designs and used the finite element method to simulate their response with the Johnson-Cook material model. Using an autoencoder, we encoded the cross-sectional images of the lattices into latent design feature vectors, which were supplied to the neural network model to generate predictions. The trained models can accurately predict lattice energy absorption curves in the key-based design system and can be extended to new designs outside of the system via transfer learning.
    Emergent Bartering Behaviour in Multi-Agent Reinforcement Learning. (arXiv:2205.06760v1 [cs.AI])
    Advances in artificial intelligence often stem from the development of new environments that abstract real-world situations into a form where research can be done conveniently. This paper contributes such an environment based on ideas inspired by elementary Microeconomics. Agents learn to produce resources in a spatially complex world, trade them with one another, and consume those that they prefer. We show that the emergent production, consumption, and pricing behaviors respond to environmental conditions in the directions predicted by supply and demand shifts in Microeconomics. We also demonstrate settings where the agents' emergent prices for goods vary over space, reflecting the local abundance of goods. After the price disparities emerge, some agents then discover a niche of transporting goods between regions with different prevailing prices -- a profitable strategy because they can buy goods where they are cheap and sell them where they are expensive. Finally, in a series of ablation experiments, we investigate how choices in the environmental rewards, bartering actions, agent architecture, and ability to consume tradable goods can either aid or inhibit the emergence of this economic behavior. This work is part of the environment development branch of a research program that aims to build human-like artificial general intelligence through multi-agent interactions in simulated societies. By exploring which environment features are needed for the basic phenomena of elementary microeconomics to emerge automatically from learning, we arrive at an environment that differs from those studied in prior multi-agent reinforcement learning work along several dimensions. For example, the model incorporates heterogeneous tastes and physical abilities, and agents negotiate with one another as a grounded form of communication.
    Imaging Conductivity from Current Density Magnitude using Neural Networks. (arXiv:2204.02441v3 [math.NA] UPDATED)
    Conductivity imaging represents one of the most important tasks in medical imaging. In this work we develop a neural network based reconstruction technique for imaging the conductivity from the magnitude of the internal current density. It is achieved by formulating the problem as a relaxed weighted least-gradient problem, and then approximating its minimizer by standard fully connected feedforward neural networks. We derive bounds on two components of the generalization error, i.e., approximation error and statistical error, explicitly in terms of properties of the neural networks (e.g., depth, total number of parameters, and the bound of the network parameters). We illustrate the performance and distinct features of the approach on several numerical experiments. Numerically, it is observed that the approach enjoys remarkable robustness with respect to the presence of data noise.
    Analyzing Hate Speech Data along Racial, Gender and Intersectional Axes. (arXiv:2205.06621v1 [cs.CL])
    To tackle the rising phenomenon of hate speech, efforts have been made towards data curation and analysis. When it comes to analysis of bias, previous work has focused predominantly on race. In our work, we further investigate bias in hate speech datasets along racial, gender and intersectional axes. We identify strong bias against African American English (AAE), masculine and AAE+Masculine tweets, which are annotated as disproportionately more hateful and offensive than from other demographics. We provide evidence that BERT-based models propagate this bias and show that balancing the training data for these protected attributes can lead to fairer models with regards to gender, but not race.
    Learning Keypoints from Synthetic Data for Robotic Cloth Folding. (arXiv:2205.06714v1 [cs.RO])
    Robotic cloth manipulation is challenging due to its deformability, which makes determining its full state infeasible. However, for cloth folding, it suffices to know the position of a few semantic keypoints. Convolutional neural networks (CNN) can be used to detect these keypoints, but require large amounts of annotated data, which is expensive to collect. To overcome this, we propose to learn these keypoint detectors purely from synthetic data, enabling low-cost data collection. In this paper, we procedurally generate images of towels and use them to train a CNN. We evaluate the performance of this detector for folding towels on a unimanual robot setup and find that the grasp and fold success rates are 77% and 53%, respectively. We conclude that learning keypoint detectors from synthetic data for cloth folding and related tasks is a promising research direction, discuss some failures and relate them to future work. A video of the system, as well as the codebase, more details on the CNN architecture and the training setup can be found at https://github.com/tlpss/workshop-icra-2022-cloth-keypoints.git.
    Federated Learning Under Intermittent Client Availability and Time-Varying Communication Constraints. (arXiv:2205.06730v1 [cs.LG])
    Federated learning systems facilitate training of global models in settings where potentially heterogeneous data is distributed across a large number of clients. Such systems operate in settings with intermittent client availability and/or time-varying communication constraints. As a result, the global models trained by federated learning systems may be biased towards clients with higher availability. We propose F3AST, an unbiased algorithm that dynamically learns an availability-dependent client selection strategy which asymptotically minimizes the impact of client-sampling variance on the global model convergence, enhancing performance of federated learning. The proposed algorithm is tested in a variety of settings for intermittently available clients under communication constraints, and its efficacy demonstrated on synthetic data and realistically federated benchmarking experiments using CIFAR100 and Shakespeare datasets. We show up to 186% and 8% accuracy improvements over FedAvg, and 8% and 7% over FedAdam on CIFAR100 and Shakespeare, respectively.
    EyeDAS: Securing Perception of Autonomous Cars Against the Stereoblindness Syndrome. (arXiv:2205.06765v1 [cs.LG])
    The ability to detect whether an object is a 2D or 3D object is extremely important in autonomous driving, since a detection error can have life-threatening consequences, endangering the safety of the driver, passengers, pedestrians, and others on the road. Methods proposed to distinguish between 2 and 3D objects (e.g., liveness detection methods) are not suitable for autonomous driving, because they are object dependent or do not consider the constraints associated with autonomous driving (e.g., the need for real-time decision-making while the vehicle is moving). In this paper, we present EyeDAS, a novel few-shot learning-based method aimed at securing an object detector (OD) against the threat posed by the stereoblindness syndrome (i.e., the inability to distinguish between 2D and 3D objects). We evaluate EyeDAS's real-time performance using 2,000 objects extracted from seven YouTube video recordings of street views taken by a dash cam from the driver's seat perspective. When applying EyeDAS to seven state-of-the-art ODs as a countermeasure, EyeDAS was able to reduce the 2D misclassification rate from 71.42-100% to 2.4% with a 3D misclassification rate of 0% (TPR of 1.0). We also show that EyeDAS outperforms the baseline method and achieves an AUC of over 0.999 and a TPR of 1.0 with an FPR of 0.024.
    Univariate and Multivariate LSTM Model for Short-Term Stock Market Prediction. (arXiv:2205.06673v1 [q-fin.ST])
    Designing robust and accurate prediction models has been a viable research area since a long time. While proponents of a well-functioning market predictors believe that it is difficult to accurately predict market prices but many scholars disagree. Robust and accurate prediction systems will not only be helpful to the businesses but also to the individuals in making their financial investments. This paper presents an LSTM model with two different input approaches for predicting the short-term stock prices of two Indian companies, Reliance Industries and Infosys Ltd. Ten years of historic data (2012-2021) is taken from the yahoo finance website to carry out analysis of proposed approaches. In the first approach, closing prices of two selected companies are directly applied on univariate LSTM model. For the approach second, technical indicators values are calculated from the closing prices and then collectively applied on Multivariate LSTM model. Short term market behaviour for upcoming days is evaluated. Experimental outcomes revel that approach one is useful to determine the future trend but multivariate LSTM model with technical indicators found to be useful in accurately predicting the future price behaviours.
    Upside-Down Reinforcement Learning Can Diverge in Stochastic Environments With Episodic Resets. (arXiv:2205.06595v1 [stat.ML])
    Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) -- which can be viewed as a simplified version of UDRL -- optimizes a lower bound on goal-reaching performance. This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments, similar to certain well-known traditional RL algorithms. Here we show that for a specific episodic UDRL algorithm (eUDRL, including GCSL), this is not the case, and give the causes of this limitation. To do so, we first introduce a helpful rewrite of eUDRL as a recursive policy update. This formulation helps to disprove its convergence to the optimal policy for a wide class of stochastic environments. Finally, we provide a concrete example of a very simple environment where eUDRL diverges. Since the primary aim of this paper is to present a negative result, and the best counterexamples are the simplest ones, we restrict all discussions to finite (discrete) environments, ignoring issues of function approximation and limited sample size.
    A Vision Inspired Neural Network for Unsupervised Anomaly Detection in Unordered Data. (arXiv:2205.06716v1 [cs.LG])
    A fundamental problem in the field of unsupervised machine learning is the detection of anomalies corresponding to rare and unusual observations of interest; reasons include for their rejection, accommodation or further investigation. Anomalies are intuitively understood to be something unusual or inconsistent, whose occurrence sparks immediate attention. More formally anomalies are those observations-under appropriate random variable modelling-whose expectation of occurrence with respect to a grouping of prior interest is less than one; such a definition and understanding has been used to develop the parameter-free perception anomaly detection algorithm. The present work seeks to establish important and practical connections between the approach used by the perception algorithm and prior decades of research in neurophysiology and computational neuroscience; particularly that of information processing in the retina and visual cortex. The algorithm is conceptualised as a neuron model which forms the kernel of an unsupervised neural network that learns to signal unexpected observations as anomalies. Both the network and neuron display properties observed in biological processes including: immediate intelligence; parallel processing; redundancy; global degradation; contrast invariance; parameter-free computation, dynamic thresholds and non-linear processing. A robust and accurate model for anomaly detection in univariate and multivariate data is built using this network as a concrete application.
    Local Attention Graph-based Transformer for Multi-target Genetic Alteration Prediction. (arXiv:2205.06672v1 [cs.CV])
    Classical multiple instance learning (MIL) methods are often based on the identical and independent distributed assumption between instances, hence neglecting the potentially rich contextual information beyond individual entities. On the other hand, Transformers with global self-attention modules have been proposed to model the interdependencies among all instances. However, in this paper we question: Is global relation modeling using self-attention necessary, or can we appropriately restrict self-attention calculations to local regimes in large-scale whole slide images (WSIs)? We propose a general-purpose local attention graph-based Transformer for MIL (LA-MIL), introducing an inductive bias by explicitly contextualizing instances in adaptive local regimes of arbitrary size. Additionally, an efficiently adapted loss function enables our approach to learn expressive WSI embeddings for the joint analysis of multiple biomarkers. We demonstrate that LA-MIL achieves state-of-the-art results in mutation prediction for gastrointestinal cancer, outperforming existing models on important biomarkers such as microsatellite instability for colorectal cancer. This suggests that local self-attention sufficiently models dependencies on par with global modules. Our implementation will be published.
    Multiple Domain Causal Networks. (arXiv:2205.06791v1 [stat.ML])
    Observational studies are regarded as economic alternatives to randomized trials, often used in their stead to investigate and determine treatment efficacy. Due to lack of sample size, observational studies commonly combine data from multiple sources or different sites/centers. Despite the benefits of an increased sample size, a naive combination of multicenter data may result in incongruities stemming from center-specific protocols for generating cohorts or reactions towards treatments distinct to a given center, among other things. These issues arise in a variety of other contexts, including capturing a treatment effect related to an individual's unique biological characteristics. Existing methods for estimating heterogeneous treatment effects have not adequately addressed the multicenter context, but rather treat it simply as a means to obtain sufficient sample size. Additionally, previous approaches to estimating treatment effects do not straightforwardly generalize to the multicenter design, especially when required to provide treatment insights for patients from a new, unobserved center. To address these shortcomings, we propose Multiple Domain Causal Networks (MDCN), an approach that simultaneously strengthens the information sharing between similar centers while addressing the selection bias in treatment assignment through learning of a new feature embedding. In empirical evaluations, MDCN is consistently more accurate when estimating the heterogeneous treatment effect in new centers compared to benchmarks that adjust solely based on treatment imbalance or general center differences. Finally, we justify our approach by providing theoretical analyses that demonstrate that MDCN improves on the generalization bound of the new, unobserved target center.
    The Devil is in the Details: On the Pitfalls of Vocabulary Selection in Neural Machine Translation. (arXiv:2205.06618v1 [cs.CL])
    Vocabulary selection, or lexical shortlisting, is a well-known technique to improve latency of Neural Machine Translation models by constraining the set of allowed output words during inference. The chosen set is typically determined by separately trained alignment model parameters, independent of the source-sentence context at inference time. While vocabulary selection appears competitive with respect to automatic quality metrics in prior work, we show that it can fail to select the right set of output words, particularly for semantically non-compositional linguistic phenomena such as idiomatic expressions, leading to reduced translation quality as perceived by humans. Trading off latency for quality by increasing the size of the allowed set is often not an option in real-world scenarios. We propose a model of vocabulary selection, integrated into the neural translation model, that predicts the set of allowed output words from contextualized encoder representations. This restores translation quality of an unconstrained system, as measured by human evaluations on WMT newstest2020 and idiomatic expressions, at an inference latency competitive with alignment-based selection using aggressive thresholds, thereby removing the dependency on separately trained alignment models.
    Accelerometry-based classification of circulatory states during out-of-hospital cardiac arrest. (arXiv:2205.06540v1 [eess.SP])
    Objective: During cardiac arrest treatment, a reliable detection of spontaneous circulation, usually performed by manual pulse checks, is both vital for patient survival and practically challenging. Methods: We developed a machine learning algorithm to automatically predict the circulatory state during cardiac arrest treatment from 4-second-long snippets of accelerometry and electrocardiogram data from real-world defibrillator records. The algorithm was trained based on 917 cases from the German Resuscitation Registry, for which ground truth labels were created by a manual annotation of physicians. It uses a kernelized Support Vector Machine classifier based on 14 features, which partially reflect the correlation between accelerometry and electrocardiogram data. Results: On a test data set, the proposed algorithm exhibits an accuracy of 94.4 (93.6, 95.2)%, a sensitivity of 95.0 (93.9, 96.1)%, and a specificity of 93.9 (92.7, 95.1)%. Conclusion and significance: In application, the algorithm may be used to simplify retrospective annotation for quality management and, moreover, to support clinicians to assess circulatory state during cardiac arrest treatment.
    Heavy-Tail Phenomenon in Decentralized SGD. (arXiv:2205.06689v1 [stat.ML])
    Recent theoretical studies have shown that heavy-tails can emerge in stochastic optimization due to `multiplicative noise', even under surprisingly simple settings, such as linear regression with Gaussian data. While these studies have uncovered several interesting phenomena, they consider conventional stochastic optimization problems, which exclude decentralized settings that naturally arise in modern machine learning applications. In this paper, we study the emergence of heavy-tails in decentralized stochastic gradient descent (DE-SGD), and investigate the effect of decentralization on the tail behavior. We first show that, when the loss function at each computational node is twice continuously differentiable and strongly convex outside a compact region, the law of the DE-SGD iterates converges to a distribution with polynomially decaying (heavy) tails. To have a more explicit control on the tail exponent, we then consider the case where the loss at each node is a quadratic, and show that the tail-index can be estimated as a function of the step-size, batch-size, and the topological properties of the network of the computational nodes. Then, we provide theoretical and empirical results showing that DE-SGD has heavier tails than centralized SGD. We also compare DE-SGD to disconnected SGD where nodes distribute the data but do not communicate. Our theory uncovers an interesting interplay between the tails and the network structure: we identify two regimes of parameters (stepsize and network size), where DE-SGD %addition of network structure can have lighter or heavier tails than disconnected SGD depending on the regime. Finally, to support our theoretical results, we provide numerical experiments conducted on both synthetic data and neural networks.
    Improving Astronomical Time-series Classification via Data Augmentation with Generative Adversarial Networks. (arXiv:2205.06758v1 [astro-ph.IM])
    Due to the latest advances in technology, telescopes with significant sky coverage will produce millions of astronomical alerts per night that must be classified both rapidly and automatically. Currently, classification consists of supervised machine learning algorithms whose performance is limited by the number of existing annotations of astronomical objects and their highly imbalanced class distributions. In this work, we propose a data augmentation methodology based on Generative Adversarial Networks (GANs) to generate a variety of synthetic light curves from variable stars. Our novel contributions, consisting of a resampling technique and an evaluation metric, can assess the quality of generative models in unbalanced datasets and identify GAN-overfitting cases that the Fr\'echet Inception Distance does not reveal. We applied our proposed model to two datasets taken from the Catalina and Zwicky Transient Facility surveys. The classification accuracy of variable stars is improved significantly when training with synthetic data and testing with real data with respect to the case of using only real data.
    On the Existence of Simpler Machine Learning Models. (arXiv:1908.01755v4 [cs.LG] UPDATED)
    It is almost always easier to find an accurate-but-complex model than an accurate-yet-simple model. Finding optimal, sparse, accurate models of various forms (linear models with integer coefficients, decision sets, rule lists, decision trees) is generally NP-hard. We often do not know whether the search for a simpler model will be worthwhile, and thus we do not go to the trouble of searching for one. In this work, we ask an important practical question: can accurate-yet-simple models be proven to exist, or shown likely to exist, before explicitly searching for them? We hypothesize that there is an important reason that simple-yet-accurate models often do exist. This hypothesis is that the size of the Rashomon set is often large, where the Rashomon set is the set of almost-equally-accurate models from a function class. If the Rashomon set is large, it contains numerous accurate models, and perhaps at least one of them is the simple model we desire. In this work, we formally present the Rashomon ratio as a new gauge of simplicity for a learning problem, depending on a function class and a data set. The Rashomon ratio is the ratio of the volume of the set of accurate models to the volume of the hypothesis space, and it is different from standard complexity measures from statistical learning theory. Insight from studying the Rashomon ratio provides an easy way to check whether a simpler model might exist for a problem before finding it, namely whether several different machine learning methods achieve similar performance on the data. In that sense, the Rashomon ratio is a powerful tool for understanding why and when an accurate-yet-simple model might exist. If, as we hypothesize in this work, many real-world data sets admit large Rashomon sets, the implications are vast: it means that simple or interpretable models may often be used for high-stakes decisions without losing accuracy.
    Tensor Decompositions for Hyperspectral Data Processing in Remote Sensing: A Comprehensive Review. (arXiv:2205.06407v1 [cs.CV])
    Owing to the rapid development of sensor technology, hyperspectral (HS) remote sensing (RS) imaging has provided a significant amount of spatial and spectral information for the observation and analysis of the Earth's surface at a distance of data acquisition devices, such as aircraft, spacecraft, and satellite. The recent advancement and even revolution of the HS RS technique offer opportunities to realize the full potential of various applications, while confronting new challenges for efficiently processing and analyzing the enormous HS acquisition data. Due to the maintenance of the 3-D HS inherent structure, tensor decomposition has aroused widespread concern and research in HS data processing tasks over the past decades. In this article, we aim at presenting a comprehensive overview of tensor decomposition, specifically contextualizing the five broad topics in HS data processing, and they are HS restoration, compressed sensing, anomaly detection, super-resolution, and spectral unmixing. For each topic, we elaborate on the remarkable achievements of tensor decomposition models for HS RS with a pivotal description of the existing methodologies and a representative exhibition on the experimental results. As a result, the remaining challenges of the follow-up research directions are outlined and discussed from the perspective of the real HS RS practices and tensor decomposition merged with advanced priors and even with deep neural networks. This article summarizes different tensor decomposition-based HS data processing methods and categorizes them into different classes from simple adoptions to complex combinations with other priors for the algorithm beginners. We also expect this survey can provide new investigations and development trends for the experienced researchers who understand tensor decomposition and HS RS to some extent.
    FastSTMF: Efficient tropical matrix factorization algorithm for sparse data. (arXiv:2205.06619v1 [cs.LG])
    Matrix factorization, one of the most popular methods in machine learning, has recently benefited from introducing non-linearity in prediction tasks using tropical semiring. The non-linearity enables a better fit to extreme values and distributions, thus discovering high-variance patterns that differ from those found by standard linear algebra. However, the optimization process of various tropical matrix factorization methods is slow. In our work, we propose a new method FastSTMF based on Sparse Tropical Matrix Factorization (STMF), which introduces a novel strategy for updating factor matrices that results in efficient computational performance. We evaluated the efficiency of FastSTMF on synthetic and real gene expression data from the TCGA database, and the results show that FastSTMF outperforms STMF in both accuracy and running time. Compared to NMF, we show that FastSTMF performs better on some datasets and is not prone to overfitting as NMF. This work sets the basis for developing other matrix factorization techniques based on many other semirings using a new proposed optimization process.
    Deep Reinforcement Learning for Computational Fluid Dynamics on HPC Systems. (arXiv:2205.06502v1 [cs.LG])
    Reinforcement learning (RL) is highly suitable for devising control strategies in the context of dynamical systems. A prominent instance of such a dynamical system is the system of equations governing fluid dynamics. Recent research results indicate that RL-augmented computational fluid dynamics (CFD) solvers can exceed the current state of the art, for example in the field of turbulence modeling. However, while in supervised learning, the training data can be generated a priori in an offline manner, RL requires constant run-time interaction and data exchange with the CFD solver during training. In order to leverage the potential of RL-enhanced CFD, the interaction between the CFD solver and the RL algorithm thus have to be implemented efficiently on high-performance computing (HPC) hardware. To this end, we present Relexi as a scalable RL framework that bridges the gap between machine learning workflows and modern CFD solvers on HPC systems providing both components with its specialized hardware. Relexi is built with modularity in mind and allows easy integration of various HPC solvers by means of the in-memory data transfer provided by the SmartSim library. Here, we demonstrate that the Relexi framework can scale up to hundreds of parallel environment on thousands of cores. This allows to leverage modern HPC resources to either enable larger problems or faster turnaround times. Finally, we demonstrate the potential of an RL-augmented CFD solver by finding a control strategy for optimal eddy viscosity selection in large eddy simulations.
    Two-layer neural networks with values in a Banach space. (arXiv:2105.02095v2 [cs.LG] UPDATED)
    We study two-layer neural networks whose domain and range are Banach spaces with separable preduals. In addition, we assume that the image space is equipped with a partial order, i.e. it is a Riesz space. As the nonlinearity we choose the lattice operation of taking the positive part; in case of $\mathbb R^d$-valued neural networks this corresponds to the ReLU activation function. We prove inverse and direct approximation theorems with Monte-Carlo rates for a certain class of functions, extending existing results for the finite-dimensional case. In the second part of the paper, we study, from the regularisation theory viewpoint, the problem of finding optimal representations of such functions via signed measures on a latent space from a finite number of noisy observations. We discuss regularity conditions known as source conditions and obtain convergence rates in a Bregman distance for the representing measure in the regime when both the noise level goes to zero and the number of samples goes to infinity at appropriate rates.
    Discovering the building blocks of dark matter halo density profiles with neural networks. (arXiv:2203.08827v2 [astro-ph.CO] UPDATED)
    The density profiles of dark matter halos are typically modeled using empirical formulae fitted to the density profiles of relaxed halo populations. We present a neural network model that is trained to learn the mapping from the raw density field containing each halo to the dark matter density profile. We show that the model recovers the widely-used Navarro-Frenk-White (NFW) profile out to the virial radius, and can additionally describe the variability in the outer profile of the halos. The neural network architecture consists of a supervised encoder-decoder framework, which first compresses the density inputs into a low-dimensional latent representation, and then outputs $\rho(r)$ for any desired value of radius $r$. The latent representation contains all the information used by the model to predict the density profiles. This allows us to interpret the latent representation by quantifying the mutual information between the representation and the halos' ground-truth density profiles. A two-dimensional representation is sufficient to accurately model the density profiles up to the virial radius; however, a three-dimensional representation is required to describe the outer profiles beyond the virial radius. The additional dimension in the representation contains information about the infalling material in the outer profiles of dark matter halos, thus discovering the splashback boundary of halos without prior knowledge of the halos' dynamical history.
    Space4HGNN: A Novel, Modularized and Reproducible Platform to Evaluate Heterogeneous Graph Neural Network. (arXiv:2202.09177v2 [cs.LG] UPDATED)
    Heterogeneous Graph Neural Network (HGNN) has been successfully employed in various tasks, but we cannot accurately know the importance of different design dimensions of HGNNs due to diverse architectures and applied scenarios. Besides, in the research community of HGNNs, implementing and evaluating various tasks still need much human effort. To mitigate these issues, we first propose a unified framework covering most HGNNs, consisting of three components: heterogeneous linear transformation, heterogeneous graph transformation, and heterogeneous message passing layer. Then we build a platform Space4HGNN by defining a design space for HGNNs based on the unified framework, which offers modularized components, reproducible implementations, and standardized evaluation for HGNNs. Finally, we conduct experiments to analyze the effect of different designs. With the insights found, we distill a condensed design space and verify its effectiveness.
    Stability to Deformations in Manifold Neural Networks. (arXiv:2106.03725v2 [cs.LG] UPDATED)
    Stability is an important property of graph neural networks (GNNs) which explains their success in many problems of practical interest. Existing GNN stability results depend on the size of the graph, restricting applicability to graphs of moderate size. To understand the stability properties of GNNs on large graphs, we define manifold convolutions and consider neural networks supported on manifolds. These are defined in terms of manifold diffusions mediated by the Laplace-Beltrami (LB) operator and are interpreted as limits of GNNs running on graphs of growing size. We define manifold deformations and show that they lead to perturbations of the manifold's LB operator that consist of an absolute and a relative perturbation term. We then define two frequency dependent manifold filters that split the infinite dimensional spectrum of the LB operator in finite partitions, and prove that these filters are stable to absolute and relative perturbations of the LB operator respectively. We also observe a trade-off between the stability and the discriminability from the stability bounds. Moreover, manifold neural networks (MNNs) composed of these filters inherit the stability properties while the nonlinear activation function helps to improve the discriminability. Therefore, the MNNs can be both stable and discriminative. We verify our results numerically in shape classification with point cloud datasets.
    Data-Driven Upper Bounds on Channel Capacity. (arXiv:2205.06471v1 [cs.IT])
    We consider the problem of estimating an upper bound on the capacity of a memoryless channel with unknown channel law and continuous output alphabet. A novel data-driven algorithm is proposed that exploits the dual representation of capacity where the maximization over the input distribution is replaced with a minimization over a reference distribution on the channel output. To efficiently compute the required divergence maximization between the conditional channel and the reference distribution, we use a modified mutual information neural estimator that takes the channel input as an additional parameter. We evaluate our approach on different memoryless channels and show that the estimated upper bounds closely converge either to the channel capacity or to best-known lower bounds.
    Detecting Rumours with Latency Guarantees using Massive Streaming Data. (arXiv:2205.06580v1 [cs.SI])
    Today's social networks continuously generate massive streams of data, which provide a valuable starting point for the detection of rumours as soon as they start to propagate. However, rumour detection faces tight latency bounds, which cannot be met by contemporary algorithms, given the sheer volume of high-velocity streaming data emitted by social networks. Hence, in this paper, we argue for best-effort rumour detection that detects most rumours quickly rather than all rumours with a high delay. To this end, we combine techniques for efficient, graph-based matching of rumour patterns with effective load shedding that discards some of the input data while minimising the loss in accuracy. Experiments with large-scale real-world datasets illustrate the robustness of our approach in terms of runtime performance and detection accuracy under diverse streaming conditions.
    DRBM-ClustNet: A Deep Restricted Boltzmann-Kohonen Architecture for Data Clustering. (arXiv:2205.06697v1 [cs.LG])
    A Bayesian Deep Restricted Boltzmann-Kohonen architecture for data clustering termed as DRBM-ClustNet is proposed. This core-clustering engine consists of a Deep Restricted Boltzmann Machine (DRBM) for processing unlabeled data by creating new features that are uncorrelated and have large variance with each other. Next, the number of clusters are predicted using the Bayesian Information Criterion (BIC), followed by a Kohonen Network-based clustering layer. The processing of unlabeled data is done in three stages for efficient clustering of the non-linearly separable datasets. In the first stage, DRBM performs non-linear feature extraction by capturing the highly complex data representation by projecting the feature vectors of $d$ dimensions into $n$ dimensions. Most clustering algorithms require the number of clusters to be decided a priori, hence here to automate the number of clusters in the second stage we use BIC. In the third stage, the number of clusters derived from BIC forms the input for the Kohonen network, which performs clustering of the feature-extracted data obtained from the DRBM. This method overcomes the general disadvantages of clustering algorithms like the prior specification of the number of clusters, convergence to local optima and poor clustering accuracy on non-linear datasets. In this research we use two synthetic datasets, fifteen benchmark datasets from the UCI Machine Learning repository, and four image datasets to analyze the DRBM-ClustNet. The proposed framework is evaluated based on clustering accuracy and ranked against other state-of-the-art clustering methods. The obtained results demonstrate that the DRBM-ClustNet outperforms state-of-the-art clustering algorithms.
    DualCF: Efficient Model Extraction Attack from Counterfactual Explanations. (arXiv:2205.06504v1 [cs.CR])
    Cloud service providers have launched Machine-Learning-as-a-Service (MLaaS) platforms to allow users to access large-scale cloudbased models via APIs. In addition to prediction outputs, these APIs can also provide other information in a more human-understandable way, such as counterfactual explanations (CF). However, such extra information inevitably causes the cloud models to be more vulnerable to extraction attacks which aim to steal the internal functionality of models in the cloud. Due to the black-box nature of cloud models, however, a vast number of queries are inevitably required by existing attack strategies before the substitute model achieves high fidelity. In this paper, we propose a novel simple yet efficient querying strategy to greatly enhance the querying efficiency to steal a classification model. This is motivated by our observation that current querying strategies suffer from decision boundary shift issue induced by taking far-distant queries and close-to-boundary CFs into substitute model training. We then propose DualCF strategy to circumvent the above issues, which is achieved by taking not only CF but also counterfactual explanation of CF (CCF) as pairs of training samples for the substitute model. Extensive and comprehensive experimental evaluations are conducted on both synthetic and real-world datasets. The experimental results favorably illustrate that DualCF can produce a high-fidelity model with fewer queries efficiently and effectively.
    l-Leaks: Membership Inference Attacks with Logits. (arXiv:2205.06469v1 [cs.LG])
    Machine Learning (ML) has made unprecedented progress in the past several decades. However, due to the memorability of the training data, ML is susceptible to various attacks, especially Membership Inference Attacks (MIAs), the objective of which is to infer the model's training data. So far, most of the membership inference attacks against ML classifiers leverage the shadow model with the same structure as the target model. However, empirical results show that these attacks can be easily mitigated if the shadow model is not clear about the network structure of the target model. In this paper, We present attacks based on black-box access to the target model. We name our attack \textbf{l-Leaks}. The l-Leaks follows the intuition that if an established shadow model is similar enough to the target model, then the adversary can leverage the shadow model's information to predict a target sample's membership.The logits of the trained target model contain valuable sample knowledge. We build the shadow model by learning the logits of the target model and making the shadow model more similar to the target model. Then shadow model will have sufficient confidence in the member samples of the target model. We also discuss the effect of the shadow model's different network structures to attack results. Experiments over different networks and datasets demonstrate that both of our attacks achieve strong performance.
    Uninorm-like parametric activation functions for human-understandable neural models. (arXiv:2205.06547v1 [cs.AI])
    We present a deep learning model for finding human-understandable connections between input features. Our approach uses a parameterized, differentiable activation function, based on the theoretical background of nilpotent fuzzy logic and multi-criteria decision-making (MCDM). The learnable parameter has a semantic meaning indicating the level of compensation between input features. The neural network determines the parameters using gradient descent to find human-understandable relationships between input features. We demonstrate the utility and effectiveness of the model by successfully applying it to classification problems from the UCI Machine Learning Repository.
    FRC-TOuNN: Topology Optimization of Continuous Fiber Reinforced Composites using Neural Network. (arXiv:2205.03737v1 [cs.CE] CROSS LISTED)
    In this paper, we present a topology optimization (TO) framework to simultaneously optimize the matrix topology and fiber distribution of functionally graded continuous fiber-reinforced composites (FRC). Current approaches in density-based TO for FRC use the underlying finite element mesh both for analysis and design representation. This poses several limitations while enforcing sub-element fiber spacing and generating high-resolution continuous fibers. In contrast, we propose a mesh-independent representation based on a neural network (NN) both to capture the matrix topology and fiber distribution. The implicit NN-based representation enables geometric and material queries at a higher resolution than a mesh discretization. This leads to the accurate extraction of functionally-graded continuous fibers. Further, by integrating the finite element simulations into the NN computational framework, we can leverage automatic differentiation for end-to-end automated sensitivity analysis, i.e., we no longer need to manually derive cumbersome sensitivity expressions. We demonstrate the effectiveness and computational efficiency of the proposed method through several numerical examples involving various objective functions. We also show that the optimized continuous fiber reinforced composites can be directly fabricated at high resolution using additive manufacturing.  ( 2 min )
    Modular Adaptive Policy Selection for Multi-Task Imitation Learning through Task Division. (arXiv:2203.14855v2 [cs.LG] UPDATED)
    Deep imitation learning requires many expert demonstrations, which can be hard to obtain, especially when many tasks are involved. However, different tasks often share similarities, so learning them jointly can greatly benefit them and alleviate the need for many demonstrations. But, joint multi-task learning often suffers from negative transfer, sharing information that should be task-specific. In this work, we introduce a method to perform multi-task imitation while allowing for task-specific features. This is done by using proto-policies as modules to divide the tasks into simple sub-behaviours that can be shared. The proto-policies operate in parallel and are adaptively chosen by a selector mechanism that is jointly trained with the modules. Experiments on different sets of tasks show that our method improves upon the accuracy of single agents, task-conditioned and multi-headed multi-task agents, as well as state-of-the-art meta learning agents. We also demonstrate its ability to autonomously divide the tasks into both shared and task-specific sub-behaviours.  ( 2 min )
    Exploiting Expert-guided Symmetry Detection in Offline Reinforcement Learning. (arXiv:2112.09943v2 [cs.LG] UPDATED)
    Offline estimation of the dynamical model of a Markov Decision Process (MDP) is a non-trivial task that greatly depends on the data available to the learning phase. Sometimes the dynamics of the model is invariant with respect to some transformations of the current state and action. Recent works showed that an expert-guided pipeline relying on Density Estimation methods as Deep Neural Network based Normalizing Flows effectively detects this structure in deterministic environments, both categorical and continuous-valued. The acquired knowledge can be exploited to augment the original data set, leading eventually to a reduction in the distributional shift between the true and the learnt model. Such data augmentation technique can be exploited as a preliminary process to be executed before the adoption of an Offline Reinforcement Learning architecture, increasing its performance. In this work we extend the paradigm to also tackle non deterministic MDPs, in particular 1) we propose a detection threshold in categorical environments based on statistical distances, and 2) we show that the former results lead to a performance improvement when solving the learnt MDP and then applying the optimal policy in the real environment.  ( 2 min )
    Rainbow Differential Privacy. (arXiv:2202.03974v2 [cs.CR] UPDATED)
    We extend a previous framework for designing differentially private (DP) mechanisms via randomized graph colorings that was restricted to binary functions, corresponding to colorings in a graph, to multi-valued functions. As before, datasets are nodes in the graph and any two neighboring datasets are connected by an edge. In our setting, we assume that each dataset has a preferential ordering for the possible outputs of the mechanism, each of which we refer to as a rainbow. Different rainbows partition the graph of datasets into different regions. We show that if the DP mechanism is pre-specified at the boundary of such regions and behaves identically for all same-rainbow boundary datasets, at most one optimal such mechanism can exist and the problem can be solved by means of a morphism to a line graph. We then show closed form expressions for the line graph in the case of ternary functions. Treatment of ternary queries in this paper displays enough richness to be extended to higher-dimensional query spaces with preferential query ordering, but the optimality proof does not seem to follow directly from the ternary proof.  ( 2 min )
    Improving Contextual Representation with Gloss Regularized Pre-training. (arXiv:2205.06603v1 [cs.CL])
    Though achieving impressive results on many NLP tasks, the BERT-like masked language models (MLM) encounter the discrepancy between pre-training and inference. In light of this gap, we investigate the contextual representation of pre-training and inference from the perspective of word probability distribution. We discover that BERT risks neglecting the contextual word similarity in pre-training. To tackle this issue, we propose an auxiliary gloss regularizer module to BERT pre-training (GR-BERT), to enhance word semantic similarity. By predicting masked words and aligning contextual embeddings to corresponding glosses simultaneously, the word similarity can be explicitly modeled. We design two architectures for GR-BERT and evaluate our model in downstream tasks. Experimental results show that the gloss regularizer benefits BERT in word-level and sentence-level semantic representation. The GR-BERT achieves new state-of-the-art in lexical substitution task and greatly promotes BERT sentence representation in both unsupervised and supervised STS tasks.  ( 2 min )
    AutoMat: Accelerated Computational Electrochemical systems Discovery. (arXiv:2011.04426v4 [cond-mat.mtrl-sci] UPDATED)
    Large-scale electrification is vital to addressing the climate crisis, but several scientific and technological challenges remain to fully electrify both the chemical industry and transportation. In both of these areas, new electrochemical materials will be critical, but their development currently relies heavily on human-time-intensive experimental trial and error and computationally expensive first-principles, meso-scale and continuum simulations. We present an automated workflow, AutoMat, that accelerates these computational steps by introducing both automated input generation and management of simulations across scales from first principles to continuum device modeling. Furthermore, we show how to seamlessly integrate multi-fidelity predictions such as machine learning surrogates or automated robotic experiments "in-the-loop". The automated framework is implemented with design space search techniques to dramatically accelerate the overall materials discovery pipeline by implicitly learning design features that optimize device performance across several metrics. We discuss the benefits of AutoMat using examples in electrocatalysis and energy storage and highlight lessons learned.  ( 2 min )
    Fast Conditional Network Compression Using Bayesian HyperNetworks. (arXiv:2205.06404v1 [cs.LG])
    We introduce a conditional compression problem and propose a fast framework for tackling it. The problem is how to quickly compress a pretrained large neural network into optimal smaller networks given target contexts, e.g. a context involving only a subset of classes or a context where only limited compute resource is available. To solve this, we propose an efficient Bayesian framework to compress a given large network into much smaller size tailored to meet each contextual requirement. We employ a hypernetwork to parameterize the posterior distribution of weights given conditional inputs and minimize a variational objective of this Bayesian neural network. To further reduce the network sizes, we propose a new input-output group sparsity factorization of weights to encourage more sparseness in the generated weights. Our methods can quickly generate compressed networks with significantly smaller sizes than baseline methods.
    Verifiable and Compositional Reinforcement Learning Systems. (arXiv:2106.05864v3 [cs.LG] UPDATED)
    We propose a framework for verifiable and compositional reinforcement learning (RL) in which a collection of RL subsystems, each of which learns to accomplish a separate subtask, are composed to achieve an overall task. The framework consists of a high-level model, represented as a parametric Markov decision process (pMDP) which is used to plan and to analyze compositions of subsystems, and of the collection of low-level subsystems themselves. By defining interfaces between the subsystems, the framework enables automatic decompositions of task specifications, e.g., reach a target set of states with a probability of at least 0.95, into individual subtask specifications, i.e. achieve the subsystem's exit conditions with at least some minimum probability, given that its entry conditions are met. This in turn allows for the independent training and testing of the subsystems; if they each learn a policy satisfying the appropriate subtask specification, then their composition is guaranteed to satisfy the overall task specification. Conversely, if the subtask specifications cannot all be satisfied by the learned policies, we present a method, formulated as the problem of finding an optimal set of parameters in the pMDP, to automatically update the subtask specifications to account for the observed shortcomings. The result is an iterative procedure for defining subtask specifications, and for training the subsystems to meet them. As an additional benefit, this procedure allows for particularly challenging or important components of an overall task to be determined automatically, and focused on, during training. Experimental results demonstrate the presented framework's novel capabilities.  ( 3 min )
    Collaborative Drug Discovery: Inference-level Data Protection Perspective. (arXiv:2205.06506v1 [cs.CR])
    Pharmaceutical industry can better leverage its data assets to virtualize drug discovery through a collaborative machine learning platform. On the other hand, there are non-negligible risks stemming from the unintended leakage of participants' training data, hence, it is essential for such a platform to be secure and privacy-preserving. This paper describes a privacy risk assessment for collaborative modeling in the preclinical phase of drug discovery to accelerate the selection of promising drug candidates. After a short taxonomy of state-of-the-art inference attacks we adopt and customize several to the underlying scenario. Finally we describe and experiments with a handful of relevant privacy protection techniques to mitigate such attacks.  ( 2 min )
    Computing Multiple Image Reconstructions with a Single Hypernetwork. (arXiv:2202.11009v3 [cs.CV] UPDATED)
    Deep learning based techniques achieve state-of-the-art results in a wide range of image reconstruction tasks like compressed sensing. These methods almost always have hyperparameters, such as the weight coefficients that balance the different terms in the optimized loss function. The typical approach is to train the model for a hyperparameter setting determined with some empirical or theoretical justification. Thus, at inference time, the model can only compute reconstructions corresponding to the pre-determined hyperparameter values. In this work, we present a hypernetwork-based approach, called HyperRecon, to train reconstruction models that are agnostic to hyperparameter settings. At inference time, HyperRecon can efficiently produce diverse reconstructions, which would each correspond to different hyperparameter values. In this framework, the user is empowered to select the most useful output(s) based on their own judgement. We demonstrate our method in compressed sensing, super-resolution and denoising tasks, using two large-scale and publicly-available MRI datasets. Our code is available at https://github.com/alanqrwang/hyperrecon.  ( 2 min )
    Autonomous Navigation and Configuration of Integrated Access Backhauling for UAV Base Station Using Reinforcement Learning. (arXiv:2112.07313v2 [cs.LG] UPDATED)
    Fast and reliable connectivity is essential to enhancing situational awareness and operational efficiency for public safety mission-critical (MC) users. In emergency or disaster circumstances, where existing cellular network coverage and capacity may not be available to meet MC communication demands, deployable-network-based solutions such as cells-on-wheels/wings can be utilized swiftly to ensure reliable connection for MC users. In this paper, we consider a scenario where a macro base station (BS) is destroyed due to a natural disaster and an unmanned aerial vehicle carrying BS (UAV-BS) is set up to provide temporary coverage for users in the disaster area. The UAV-BS is integrated into the mobile network using the 5G integrated access and backhaul (IAB) technology. We propose a framework and signalling procedure for applying machine learning to this use case. A deep reinforcement learning algorithm is designed to jointly optimize the access and backhaul antenna tilt as well as the three-dimensional location of the UAV-BS in order to best serve the on-ground MC users while maintaining a good backhaul connection. Our result shows that the proposed algorithm can autonomously navigate and configure the UAV-BS to improve the throughput and reduce the drop rate of MC users.  ( 2 min )
    CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model. (arXiv:2201.01018v2 [q-bio.GN] UPDATED)
    Prokaryotic viruses, which infect bacteria and archaea, are key players in microbial communities. Predicting the hosts of prokaryotic viruses helps decipher the dynamic relationship between microbes. Experimental methods for host prediction cannot keep pace with the fast accumulation of sequenced phages. Thus, there is a need for computational host prediction. Despite some promising results, computational host prediction remains a challenge because of the limited known interactions and the sheer amount of sequenced phages by high-throughput sequencing technologies. The state-of-the-art methods can only achieve 43\% accuracy at the species level. In this work, we formulate host prediction as link prediction in a knowledge graph that integrates multiple protein and DNA-based sequence features. Our implementation named CHERRY can be applied to predict hosts for newly discovered viruses and to identify viruses infecting targeted bacteria. We demonstrated the utility of CHERRY for both applications and compared its performance with 11 popular host prediction methods. To our best knowledge, CHERRY has the highest accuracy in identifying virus-prokaryote interactions. It outperforms all the existing methods at the species level with an accuracy increase of 37\%. In addition, CHERRY's performance on short contigs is more stable than other tools.  ( 2 min )
    A Machine Learning Analysis of Impact of the Covid-19 Pandemic on Alcohol Consumption Habit Changes Among Healthcare Workers in the U.S. (arXiv:2112.06261v2 [cs.LG] UPDATED)
    In this paper, we discuss the impact of the Covid-19 pandemic on alcohol consumption habit changes among healthcare workers in the United States. We utilize multiple supervised and unsupervised machine learning methods and models such as Decision Trees, Logistic Regression, Naive Bayes classifier, k-Nearest Neighbors, Support Vector Machines, Multilayer perceptron, Random Forests, XGBoost, CatBoost, LightGBM, Synthetic Minority Oversampling, Chi-Squared Test and mutual information method on a mental health survey data obtained from the University of Michigan Inter-University Consortium for Political and Social Research to find out relationships between COVID-19 related negative effects and alcohol consumption habit changes among healthcare workers. Our findings suggest that COVID-19-related school closures, COVID-19-related work schedule changes and COVID-related news exposure may lead to an increase in alcohol use among healthcare workers in the United States.  ( 2 min )
    ARCADE: Adversarially Regularized Convolutional Autoencoder for Network Anomaly Detection. (arXiv:2205.01432v2 [cs.LG] UPDATED)
    As the number of heterogenous IP-connected devices and traffic volume increase, so does the potential for security breaches. The undetected exploitation of these breaches can bring severe cybersecurity and privacy risks. In this paper, we present a practical unsupervised anomaly-based deep learning detection system called ARCADE (Adversarially Regularized Convolutional Autoencoder for unsupervised network anomaly DEtection). ARCADE exploits the property of 1D Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GAN) to automatically build a profile of the normal traffic based on a subset of raw bytes of a few initial packets of network flows so that potential network anomalies and intrusions can be effectively detected before they could cause any more damage to the network. A convolutional Autoencoder (AE) is proposed that suits online detection in resource-constrained environments, and can be easily improved for environments with higher computational capabilities. An adversarial training strategy is proposed to regularize and decrease the AE's capabilities to reconstruct network flows that are out of the normal distribution, and thereby improve its anomaly detection capabilities. The proposed approach is more effective than existing state-of-the-art deep learning approaches for network anomaly detection and significantly reduces detection time. The evaluation results show that the proposed approach is suitable for anomaly detection on resource-constrained hardware platforms such as Raspberry Pi.  ( 2 min )
    Enhanced Bilevel Optimization via Bregman Distance. (arXiv:2107.12301v2 [math.OC] UPDATED)
    Bilevel optimization has been widely applied to many machine learning problems such as hyperparameter optimization, policy optimization and meta learning. Although many bilevel optimization methods recently have been proposed to solve the bilevel optimization problems, they still suffer from high computational complexities and do not consider the more general bilevel problems with nonsmooth regularization. In the paper, thus, we propose a class of enhanced bilevel optimization methods by using Bregman distance to solve bilevel optimization problems, where the outer subproblem is nonconvex and possibly nonsmooth, and the inner subproblem is strongly convex. Specifically, we propose a bilevel optimization method based on Bregman distance (BiO-BreD) for solving deterministic bilevel problems, which reaches a lower computational complexity than the best known results. Meanwhile, we also propose a stochastic bilevel optimization method (SBiO-BreD) to solve stochastic bilevel problems based on stochastic approximated gradients and Bregman distance. Moreover, we further propose an accelerated version of SBiO-BreD method (ASBiO-BreD) by using the variance-reduced technique, which achieves a lower computational complexity than the best known computational complexity with respect to condition number $\kappa$ and target accuracy $\epsilon$ for finding an $\epsilon$-stationary point. We employ data hyper-cleaning task to demonstrate that our algorithms outperform the existing bilevel algorithms.  ( 2 min )
    Anomaly Detection using Principles of Human Perception. (arXiv:2103.12323v4 [cs.CR] UPDATED)
    In the fields of statistics and unsupervised machine learning a fundamental and well-studied problem is anomaly detection. Anomalies are difficult to define, yet many algorithms have been proposed. Underlying the approaches is the nebulous understanding that anomalies are rare, unusual or inconsistent with the majority of data. The present work provides a philosophical treatise to clearly define anomalies and develops an algorithm for their efficient detection with minimal user intervention. Inspired by the Gestalt School of Psychology and the Helmholtz principle of human perception, anomalies are assumed to be observations that are unexpected to occur with respect to certain groupings made by the majority of the data. Under appropriate random variable modelling anomalies are directly found in a set of data by a uniform and independent random assumption of the distribution of constituent elements of the observations, with anomalies corresponding to those observations where the expectation of the number of occurrences of the elements in a given view is $<1$. Starting from fundamental principles of human perception an unsupervised anomaly detection algorithm is developed that is simple, real-time and parameter-free. Experiments suggest it as a competing choice for univariate data with promising results on the detection of global anomalies in multivariate data.  ( 2 min )
    M\"obius Convolutions for Spherical CNNs. (arXiv:2201.12212v2 [cs.CV] UPDATED)
    M\"obius transformations play an important role in both geometry and spherical image processing - they are the group of conformal automorphisms of 2D surfaces and the spherical equivalent of homographies. Here we present a novel, M\"obius-equivariant spherical convolution operator which we call M\"obius convolution, and with it, develop the foundations for M\"obius-equivariant spherical CNNs. Our approach is based on a simple observation: to achieve equivariance, we only need to consider the lower-dimensional subgroup which transforms the positions of points as seen in the frames of their neighbors. To efficiently compute M\"obius convolutions at scale we derive an approximation of the action of the transformations on spherical filters, allowing us to compute our convolutions in the spectral domain with the fast Spherical Harmonic Transform. The resulting framework is both flexible and descriptive, and we demonstrate its utility by achieving promising results in both shape classification and image segmentation tasks.  ( 2 min )
    EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits. (arXiv:2110.03177v8 [cs.LG] UPDATED)
    In this paper, we propose a novel neural exploration strategy in contextual bandits, EE-Net, distinct from the standard UCB-based and TS-based approaches. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration tradeoff in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, linear contextual bandits have adopted ridge regression to estimate the reward function and combine it with TS or UCB strategies for exploration. However, this line of works explicitly assumes the reward is based on a linear function of arm vectors, which may not be true in real-world datasets. To overcome this challenge, a series of neural bandit algorithms have been proposed, where a neural network is used to learn the underlying reward function and TS or UCB are adapted for exploration. Instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose "EE-Net", a novel neural-based exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn potential gains compared to the currently estimated reward for exploration. Then, a decision-maker is constructed to combine the outputs from the Exploitation and Exploration networks. We prove that EE-Net can achieve $\mathcal{O}(\sqrt{T\log T})$ regret and show that EE-Net outperforms existing linear and neural contextual bandit baselines on real-world datasets.
    Self-Supervised Learning for Domain Adaptation on Point-Clouds. (arXiv:2003.12641v5 [cs.CV] UPDATED)
    Self-supervised learning (SSL) is a technique for learning useful representations from unlabeled data. It has been applied effectively to domain adaptation (DA) on images and videos. It is still unknown if and how it can be leveraged for domain adaptation in 3D perception problems. Here we describe the first study of SSL for DA on point clouds. We introduce a new family of pretext tasks, Deformation Reconstruction, inspired by the deformations encountered in sim-to-real transformations. In addition, we propose a novel training procedure for labeled point cloud data motivated by the MixUp method called Point cloud Mixup (PCM). Evaluations on domain adaptations datasets for classification and segmentation, demonstrate a large improvement over existing and baseline methods.
    Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions. (arXiv:2205.06811v1 [cs.LG])
    We study the linear contextual bandit problem in the presence of adversarial corruption, where the reward at each round is corrupted by an adversary, and the corruption level (i.e., the sum of corruption magnitudes over the horizon) is $C\geq 0$. The best-known algorithms in this setting are limited in that they either are computationally inefficient or require a strong assumption on the corruption, or their regret is at least $C$ times worse than the regret without corruption. In this paper, to overcome these limitations, we propose a new algorithm based on the principle of optimism in the face of uncertainty. At the core of our algorithm is a weighted ridge regression where the weight of each chosen action depends on its confidence up to some threshold. We show that for both known $C$ and unknown $C$ cases, our algorithm with proper choice of hyperparameter achieves a regret that nearly matches the lower bounds. Thus, our algorithm is nearly optimal up to logarithmic factors for both cases. Notably, our algorithm achieves the near-optimal regret for both corrupted and uncorrupted cases ($C=0$) simultaneously.
    Sharp Asymptotics of Kernel Ridge Regression Beyond the Linear Regime. (arXiv:2205.06798v1 [cs.LG])
    The generalization performance of kernel ridge regression (KRR) exhibits a multi-phased pattern that crucially depends on the scaling relationship between the sample size $n$ and the underlying dimension $d$. This phenomenon is due to the fact that KRR sequentially learns functions of increasing complexity as the sample size increases; when $d^{k-1}\ll n\ll d^{k}$, only polynomials with degree less than $k$ are learned. In this paper, we present sharp asymptotic characterization of the performance of KRR at the critical transition regions with $n \asymp d^k$, for $k\in\mathbb{Z}^{+}$. Our asymptotic characterization provides a precise picture of the whole learning process and clarifies the impact of various parameters (including the choice of the kernel function) on the generalization performance. In particular, we show that the learning curves of KRR can have a delicate "double descent" behavior due to specific bias-variance trade-offs at different polynomial scaling regimes.
    A Comprehensive Survey of Few-shot Learning: Evolution, Applications, Challenges, and Opportunities. (arXiv:2205.06743v1 [cs.LG])
    Few-shot learning (FSL) has emerged as an effective learning method and shows great potential. Despite the recent creative works in tackling FSL tasks, learning valid information rapidly from just a few or even zero samples still remains a serious challenge. In this context, we extensively investigated 200+ latest papers on FSL published in the past three years, aiming to present a timely and comprehensive overview of the most recent advances in FSL along with impartial comparisons of the strengths and weaknesses of the existing works. For the sake of avoiding conceptual confusion, we first elaborate and compare a set of similar concepts including few-shot learning, transfer learning, and meta-learning. Furthermore, we propose a novel taxonomy to classify the existing work according to the level of abstraction of knowledge in accordance with the challenges of FSL. To enrich this survey, in each subsection we provide in-depth analysis and insightful discussion about recent advances on these topics. Moreover, taking computer vision as an example, we highlight the important application of FSL, covering various research hotspots. Finally, we conclude the survey with unique insights into the technology evolution trends together with potential future research opportunities in the hope of providing guidance to follow-up research.
    Differentiable Graph Module (DGM) for Graph Convolutional Networks. (arXiv:2002.04999v4 [cs.LG] UPDATED)
    Graph deep learning has recently emerged as a powerful ML concept allowing to generalize successful deep neural architectures to non-Euclidean structured data. Such methods have shown promising results on a broad spectrum of applications ranging from social science, biomedicine, and particle physics to computer vision, graphics, and chemistry. One of the limitations of the majority of current graph neural network architectures is that they are often restricted to the transductive setting and rely on the assumption that the underlying graph is {\em known} and {\em fixed}. Often, this assumption is not true since the graph may be noisy, or partially and even completely unknown. In such cases, it would be helpful to infer the graph directly from the data, especially in inductive settings where some nodes were not present in the graph at training time. Furthermore, learning a graph may become an end in itself, as the inferred structure may provide complementary insights next to the downstream task. In this paper, we introduce Differentiable Graph Module (DGM), a learnable function that predicts edge probabilities in the graph which are optimal for the downstream task. DGM can be combined with convolutional graph neural network layers and trained in an end-to-end fashion. We provide an extensive evaluation of applications from the domains of healthcare (disease prediction), brain imaging (age prediction), computer graphics (3D point cloud segmentation), and computer vision (zero-shot learning). We show that our model provides a significant improvement over baselines both in transductive and inductive settings and achieves state-of-the-art results.
    Productivity Assessment of Neural Code Completion. (arXiv:2205.06537v1 [cs.SE])
    Neural code synthesis has reached a point where snippet generation is accurate enough to be considered for integration into human software development workflows. Commercial products aim to increase programmers' productivity, without being able to measure it directly. In this case study, we asked users of GitHub Copilot about its impact on their productivity, and sought to find a reflection of their perception in directly measurable user data. We find that the rate with which shown suggestions are accepted, rather than more specific metrics regarding the persistence of completions in the code over time, drives developers' perception of productivity.
    Latent-Graph Learning for Disease Prediction. (arXiv:2003.13620v2 [cs.LG] UPDATED)
    Recently, Graph Convolutional Networks (GCNs) have proven to be a powerful machine learning tool for Computer-Aided Diagnosis (CADx) and disease prediction. A key component in these models is to build a population graph, where the graph adjacency matrix represents pair-wise patient similarities. Until now, the similarity metrics have been defined manually, usually based on meta-features like demographics or clinical scores. The definition of the metric, however, needs careful tuning, as GCNs are very sensitive to the graph structure. In this paper, we demonstrate for the first time in the CADx domain that it is possible to learn a single, optimal graph towards the GCN's downstream task of disease classification. To this end, we propose a novel, end-to-end trainable graph learning architecture for dynamic and localized graph pruning. Unlike commonly employed spectral GCN approaches, our GCN is spatial and inductive, and can thus infer previously unseen patients as well. We demonstrate significant classification improvements with our learned graph on two CADx problems in medicine. We further explain and visualize this result using an artificial dataset, underlining the importance of graph learning for more accurate and robust inference with GCNs in medical applications.
    The Design Space of E(3)-Equivariant Atom-Centered Interatomic Potentials. (arXiv:2205.06643v1 [stat.ML])
    The rapid progress of machine learning interatomic potentials over the past couple of years produced a number of new architectures. Particularly notable among these are the Atomic Cluster Expansion (ACE), which unified many of the earlier ideas around atom density-based descriptors, and Neural Equivariant Interatomic Potentials (NequIP), a message passing neural network with equivariant features that showed state of the art accuracy. In this work, we construct a mathematical framework that unifies these models: ACE is generalised so that it can be recast as one layer of a multi-layer architecture. From another point of view, the linearised version of NequIP is understood as a particular sparsification of a much larger polynomial model. Our framework also provides a practical tool for systematically probing different choices in the unified design space. We demonstrate this by an ablation study of NequIP via a set of experiments looking at in- and out-of-domain accuracy and smooth extrapolation very far from the training data, and shed some light on which design choices are critical for achieving high accuracy. Finally, we present BOTNet (Body-Ordered-Tensor-Network), a much-simplified version of NequIP, which has an interpretable architecture and maintains accuracy on benchmark datasets.
    Generalized Variational Inference in Function Spaces: Gaussian Measures meet Bayesian Deep Learning. (arXiv:2205.06342v1 [stat.ML])
    We develop a framework for generalized variational inference in infinite-dimensional function spaces and use it to construct a method termed Gaussian Wasserstein inference (GWI). GWI leverages the Wasserstein distance between Gaussian measures on the Hilbert space of square-integrable functions in order to determine a variational posterior using a tractable optimisation criterion and avoids pathologies arising in standard variational function space inference. An exciting application of GWI is the ability to use deep neural networks in the variational parametrisation of GWI, combining their superior predictive performance with the principled uncertainty quantification analogous to that of Gaussian processes. The proposed method obtains state-of-the-art performance on several benchmark datasets.
    The ACM Multimedia 2022 Computational Paralinguistics Challenge: Vocalisations, Stuttering, Activity, & Mosquitoes. (arXiv:2205.06799v1 [cs.SD])
    The ACM Multimedia 2022 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the Vocalisations and Stuttering Sub-Challenges, a classification on human non-verbal vocalisations and speech has to be made; the Activity Sub-Challenge aims at beyond-audio human activity recognition from smartwatch sensor data; and in the Mosquitoes Sub-Challenge, mosquitoes need to be detected. We describe the Sub-Challenges, baseline feature extraction, and classifiers based on the usual ComPaRE and BoAW features, the auDeep toolkit, and deep feature extraction from pre-trained CNNs using the DeepSpectRum toolkit; in addition, we add end-to-end sequential modelling, and a log-mel-128-BNN.
    Toward A Formalized Approach for Spike Sorting Algorithms and Hardware Evaluation. (arXiv:2205.06514v1 [cs.LG])
    Spike sorting algorithms are used to separate extracellular recordings of neuronal populations into single-unit spike activities. The development of customized hardware implementing spike sorting algorithms is burgeoning. However, there is a lack of a systematic approach and a set of standardized evaluation criteria to facilitate direct comparison of both software and hardware implementations. In this paper, we formalize a set of standardized criteria and a publicly available synthetic dataset entitled Synthetic Simulations Of Extracellular Recordings (SSOER), which was constructed by aggregating existing synthetic datasets with varying Signal-To-Noise Ratios (SNRs). Furthermore, we present a benchmark for future comparison, and use our criteria to evaluate a simulated Resistive Random-Access Memory (RRAM) In-Memory Computing (IMC) system using the Discrete Wavelet Transform (DWT) for feature extraction. Our system consumes approximately (per channel) 10.72mW and occupies an area of 0.66mm$^2$ in a 22nm FDSOI Complementary Metal-Oxide-Semiconductor (CMOS) process.
    NN-EUCLID: deep-learning hyperelasticity without stress data. (arXiv:2205.06664v1 [cs.LG])
    We propose a new approach for unsupervised learning of hyperelastic constitutive laws with physics-consistent deep neural networks. In contrast to supervised learning, which assumes the availability of stress-strain pairs, the approach only uses realistically measurable full-field displacement and global reaction force data, thus it lies within the scope of our recent framework for Efficient Unsupervised Constitutive Law Identification and Discovery (EUCLID) and we denote it as NN-EUCLID. The absence of stress labels is compensated for by leveraging a physics-motivated loss function based on the conservation of linear momentum to guide the learning process. The constitutive model is based on input-convex neural networks, which are capable of learning a function that is convex with respect to its inputs. By employing a specially designed neural network architecture, multiple physical and thermodynamic constraints for hyperelastic constitutive laws, such as material frame indifference, (poly-)convexity, and stress-free reference configuration are automatically satisfied. We demonstrate the ability of the approach to accurately learn several hidden isotropic and anisotropic hyperelastic constitutive laws - including e.g., Mooney-Rivlin, Arruda-Boyce, Ogden, and Holzapfel models - without using stress data. For anisotropic hyperelasticity, the unknown anisotropic fiber directions are automatically discovered jointly with the constitutive model. The neural network-based constitutive models show good generalization capability beyond the strain states observed during training and are readily deployable in a general finite element framework for simulating complex mechanical boundary value problems with good accuracy.  ( 2 min )
    PoisonedEncoder: Poisoning the Unlabeled Pre-training Data in Contrastive Learning. (arXiv:2205.06401v1 [cs.CR])
    Contrastive learning pre-trains an image encoder using a large amount of unlabeled data such that the image encoder can be used as a general-purpose feature extractor for various downstream tasks. In this work, we propose PoisonedEncoder, a data poisoning attack to contrastive learning. In particular, an attacker injects carefully crafted poisoning inputs into the unlabeled pre-training data, such that the downstream classifiers built based on the poisoned encoder for multiple target downstream tasks simultaneously classify attacker-chosen, arbitrary clean inputs as attacker-chosen, arbitrary classes. We formulate our data poisoning attack as a bilevel optimization problem, whose solution is the set of poisoning inputs; and we propose a contrastive-learning-tailored method to approximately solve it. Our evaluation on multiple datasets shows that PoisonedEncoder achieves high attack success rates while maintaining the testing accuracy of the downstream classifiers built upon the poisoned encoder for non-attacker-chosen inputs. We also evaluate five defenses against PoisonedEncoder, including one pre-processing, three in-processing, and one post-processing defenses. Our results show that these defenses can decrease the attack success rate of PoisonedEncoder, but they also sacrifice the utility of the encoder or require a large clean pre-training dataset.  ( 2 min )
    OFedQIT: Communication-Efficient Online Federated Learning via Quantization and Intermittent Transmission. (arXiv:2205.06491v1 [cs.LG])
    Online federated learning (OFL) is a promising framework to collaboratively learn a sequence of non-linear functions (or models) from distributed streaming data incoming to multiple clients while keeping the privacy of their local data. In this framework, we first construct a vanilla method (named OFedAvg) by incorporating online gradient descent (OGD) into the de facto aggregation method (named FedAvg). Despite its optimal asymptotic performance, OFedAvg suffers from heavy communication overhead and long learning delay. To tackle these shortcomings, we propose a communication-efficient OFL algorithm (named OFedQIT) by means of a stochastic quantization and an intermittent transmission. Our major contribution is to theoretically prove that OFedQIT over $T$ time slots can achieve an optimal sublinear regret bound $\mathcal{O}(\sqrt{T})$ for any real data (including non-IID data) while significantly reducing the communication overhead. Furthermore, this optimality is still guaranteed even when a small fraction of clients (having faster processing time and high-quality communication channel) in a network are participated at once. Our analysis reveals that OFedQIT successfully addresses the drawbacks of OFedAvg while maintaining superior learning accuracy. Experiments with real datasets demonstrate the effectiveness of our algorithm on various online classification and regression tasks.  ( 2 min )
    Convergence Analysis of Deep Residual Networks. (arXiv:2205.06571v1 [cs.LG])
    Various powerful deep neural network architectures have made great contribution to the exciting successes of deep learning in the past two decades. Among them, deep Residual Networks (ResNets) are of particular importance because they demonstrated great usefulness in computer vision by winning the first place in many deep learning competitions. Also, ResNets were the first class of neural networks in the development history of deep learning that are really deep. It is of mathematical interest and practical meaning to understand the convergence of deep ResNets. We aim at characterizing the convergence of deep ResNets as the depth tends to infinity in terms of the parameters of the networks. Toward this purpose, we first give a matrix-vector description of general deep neural networks with shortcut connections and formulate an explicit expression for the networks by using the notions of activation domains and activation matrices. The convergence is then reduced to the convergence of two series involving infinite products of non-square matrices. By studying the two series, we establish a sufficient condition for pointwise convergence of ResNets. Our result is able to give justification for the design of ResNets. We also conduct experiments on benchmark machine learning data to verify our results.  ( 2 min )
    Convergence of Deep Neural Networks with General Activation Functions and Pooling. (arXiv:2205.06570v1 [cs.LG])
    Deep neural networks, as a powerful system to represent high dimensional complex functions, play a key role in deep learning. Convergence of deep neural networks is a fundamental issue in building the mathematical foundation for deep learning. We investigated the convergence of deep ReLU networks and deep convolutional neural networks in two recent researches (arXiv:2107.12530, 2109.13542). Only the Rectified Linear Unit (ReLU) activation was studied therein, and the important pooling strategy was not considered. In this current work, we study the convergence of deep neural networks as the depth tends to infinity for two other important activation functions: the leaky ReLU and the sigmoid function. Pooling will also be studied. As a result, we prove that the sufficient condition established in arXiv:2107.12530, 2109.13542 is still sufficient for the leaky ReLU networks. For contractive activation functions such as the sigmoid function, we establish a weaker sufficient condition for uniform convergence of deep neural networks.  ( 2 min )
    Test-time Fourier Style Calibration for Domain Generalization. (arXiv:2205.06427v1 [cs.CV])
    The topic of generalizing machine learning models learned on a collection of source domains to unknown target domains is challenging. While many domain generalization (DG) methods have achieved promising results, they primarily rely on the source domains at train-time without manipulating the target domains at test-time. Thus, it is still possible that those methods can overfit to source domains and perform poorly on target domains. Driven by the observation that domains are strongly related to styles, we argue that reducing the gap between source and target styles can boost models' generalizability. To solve the dilemma of having no access to the target domain during training, we introduce Test-time Fourier Style Calibration (TF-Cal) for calibrating the target domain style on the fly during testing. To access styles, we utilize Fourier transformation to decompose features into amplitude (style) features and phase (semantic) features. Furthermore, we present an effective technique to Augment Amplitude Features (AAF) to complement TF-Cal. Extensive experiments on several popular DG benchmarks and a segmentation dataset for medical images demonstrate that our method outperforms state-of-the-art methods.  ( 2 min )
    Deep Learning for Prawn Farming: Forecasting and Anomaly Detection. (arXiv:2205.06359v1 [cs.LG])
    We present a decision support system for managing water quality in prawn ponds. The system uses various sources of data and deep learning models in a novel way to provide 24-hour forecasting and anomaly detection of water quality parameters. It provides prawn farmers with tools to proactively avoid a poor growing environment, thereby optimising growth and reducing the risk of losing stock. This is a major shift for farmers who are forced to manage ponds by reactively correcting poor water quality conditions. To our knowledge, we are the first to apply Transformer as an anomaly detection model, and the first to apply anomaly detection in general to this aquaculture problem. Our technical contributions include adapting ForecastNet for multivariate data and adapting Transformer and the Attention model to incorporate weather forecast data into their decoders. We attain an average mean absolute percentage error of 12% for dissolved oxygen forecasts and we demonstrate two anomaly detection case studies. The system is successfully running in its second year of deployment on a commercial prawn farm.  ( 2 min )
    StyLandGAN: A StyleGAN based Landscape Image Synthesis using Depth-map. (arXiv:2205.06611v1 [cs.CV])
    Despite recent success in conditional image synthesis, prevalent input conditions such as semantics and edges are not clear enough to express `Linear (Ridges)' and `Planar (Scale)' representations. To address this problem, we propose a novel framework StyLandGAN, which synthesizes desired landscape images using a depth map which has higher expressive power. Our StyleLandGAN is extended from the unconditional generation model to accept input conditions. We also propose a '2-phase inference' pipeline which generates diverse depth maps and shifts local parts so that it can easily reflect user's intend. As a comparison, we modified the existing semantic image synthesis models to accept a depth map as well. Experimental results show that our method is superior to existing methods in quality, diversity, and depth-accuracy.  ( 2 min )
    KASAM: Spline Additive Models for Function Approximation. (arXiv:2205.06376v1 [cs.LG])
    Neural networks have been criticised for their inability to perform continual learning due to catastrophic forgetting and rapid unlearning of a past concept when a new concept is introduced. Catastrophic forgetting can be alleviated by specifically designed models and training techniques. This paper outlines a novel Spline Additive Model (SAM). SAM exhibits intrinsic memory retention with sufficient expressive power for many practical tasks, but is not a universal function approximator. SAM is extended with the Kolmogorov-Arnold representation theorem to a novel universal function approximator, called the Kolmogorov-Arnold Spline Additive Model - KASAM. The memory retention, expressive power and limitations of SAM and KASAM are illustrated analytically and empirically. SAM exhibited robust but imperfect memory retention, with small regions of overlapping interference in sequential learning tasks. KASAM exhibited greater susceptibility to catastrophic forgetting. KASAM in combination with pseudo-rehearsal training techniques exhibited superior performance in regression tasks and memory retention.  ( 2 min )
    Interpretable Climate Change Modeling With Progressive Cascade Networks. (arXiv:2205.06351v1 [cs.LG])
    Typical deep learning approaches to modeling high-dimensional data often result in complex models that do not easily reveal a new understanding of the data. Research in the deep learning field is very actively pursuing new methods to interpret deep neural networks and to reduce their complexity. An approach is described here that starts with linear models and incrementally adds complexity only as supported by the data. An application is shown in which models that map global temperature and precipitation to years are trained to investigate patterns associated with changes in climate.  ( 2 min )
    A hybrid data driven-physics constrained Gaussian process regression framework with deep kernel for uncertainty quantification. (arXiv:2205.06494v1 [cs.LG])
    Gaussian process regression (GPR) has been a well-known machine learning method for various applications such as uncertainty quantifications (UQ). However, GPR is inherently a data-driven method, which requires sufficiently large dataset. If appropriate physics constraints (e.g. expressed in partial differential equations) can be incorporated, the amount of data can be greatly reduced and the accuracy further improved. In this work, we propose a hybrid data driven-physics constrained Gaussian process regression framework. We encode the physics knowledge with Boltzmann-Gibbs distribution and derive our model through maximum likelihood (ML) approach. We apply deep kernel learning method. The proposed model learns from both data and physics constraints through the training of a deep neural network, which serves as part of the covariance function in GPR. The proposed model achieves good results in high-dimensional problem, and correctly propagate the uncertainty, with very limited labelled data provided.  ( 2 min )
    Warm-starting DARTS using meta-learning. (arXiv:2205.06355v1 [cs.LG])
    Neural architecture search (NAS) has shown great promise in the field of automated machine learning (AutoML). NAS has outperformed hand-designed networks and made a significant step forward in the field of automating the design of deep neural networks, thus further reducing the need for human expertise. However, most research is done targeting a single specific task, leaving research of NAS methods over multiple tasks mostly overlooked. Generally, there exist two popular ways to find an architecture for some novel task. Either searching from scratch, which is ineffective by design, or transferring discovered architectures from other tasks, which provides no performance guarantees and is probably not optimal. In this work, we present a meta-learning framework to warm-start Differentiable architecture search (DARTS). DARTS is a NAS method that can be initialized with a transferred architecture and is able to quickly adapt to new tasks. A task similarity measure is used to determine which transfer architecture is selected, as transfer architectures found on similar tasks will likely perform better. Additionally, we employ a simple meta-transfer architecture that was learned over multiple tasks. Experiments show that warm-started DARTS is able to find competitive performing architectures while reducing searching costs on average by 60%.  ( 2 min )
    How to Combine Membership-Inference Attacks on Multiple Updated Models. (arXiv:2205.06369v1 [cs.LG])
    A large body of research has shown that machine learning models are vulnerable to membership inference (MI) attacks that violate the privacy of the participants in the training data. Most MI research focuses on the case of a single standalone model, while production machine-learning platforms often update models over time, on data that often shifts in distribution, giving the attacker more information. This paper proposes new attacks that take advantage of one or more model updates to improve MI. A key part of our approach is to leverage rich information from standalone MI attacks mounted separately against the original and updated models, and to combine this information in specific ways to improve attack effectiveness. We propose a set of combination functions and tuning methods for each, and present both analytical and quantitative justification for various options. Our results on four public datasets show that our attacks are effective at using update information to give the adversary a significant advantage over attacks on standalone models, but also compared to a prior MI attack that takes advantage of model updates in a related machine-unlearning setting. We perform the first measurements of the impact of distribution shift on MI attacks with model updates, and show that a more drastic distribution shift results in significantly higher MI risk than a gradual shift. Our code is available at https://www.github.com/stanleykywu/model-updates.  ( 2 min )
    Using Natural Sentences for Understanding Biases in Language Models. (arXiv:2205.06303v1 [cs.CL])
    Evaluation of biases in language models is often limited to synthetically generated datasets. This dependence traces back to the need for a prompt-style dataset to trigger specific behaviors of language models. In this paper, we address this gap by creating a prompt dataset with respect to occupations collected from real-world natural sentences present in Wikipedia. We aim to understand the differences between using template-based prompts and natural sentence prompts when studying gender-occupation biases in language models. We find bias evaluations are very sensitive to the design choices of template prompts, and we propose using natural sentence prompts for systematic evaluations to step away from design choices that could introduce bias in the observations.  ( 2 min )
    Adaptive Block Floating-Point for Analog Deep Learning Hardware. (arXiv:2205.06287v1 [cs.LG])
    Analog mixed-signal (AMS) devices promise faster, more energy-efficient deep neural network (DNN) inference than their digital counterparts. However, recent studies show that DNNs on AMS devices with fixed-point numbers can incur an accuracy penalty because of precision loss. To mitigate this penalty, we present a novel AMS-compatible adaptive block floating-point (ABFP) number representation. We also introduce amplification (or gain) as a method for increasing the accuracy of the number representation without increasing the bit precision of the output. We evaluate the effectiveness of ABFP on the DNNs in the MLPerf datacenter inference benchmark -- realizing less than $1\%$ loss in accuracy compared to FLOAT32. We also propose a novel method of finetuning for AMS devices, Differential Noise Finetuning (DNF), which samples device noise to speed up finetuning compared to conventional Quantization-Aware Training.  ( 2 min )
    $\alpha$-GAN: Convergence and Estimation Guarantees. (arXiv:2205.06393v1 [cs.LG])
    We prove a two-way correspondence between the min-max optimization of general CPE loss function GANs and the minimization of associated $f$-divergences. We then focus on $\alpha$-GAN, defined via the $\alpha$-loss, which interpolates several GANs (Hellinger, vanilla, Total Variation) and corresponds to the minimization of the Arimoto divergence. We show that the Arimoto divergences induced by $\alpha$-GAN equivalently converge, for all $\alpha\in \mathbb{R}_{>0}\cup\{\infty\}$. However, under restricted learning models and finite samples, we provide estimation bounds which indicate diverse GAN behavior as a function of $\alpha$. Finally, we present empirical results on a toy dataset that highlight the practical utility of tuning the $\alpha$ hyperparameter.  ( 2 min )
    Integrating User and Item Reviews in Deep Cooperative Neural Networks for Movie Recommendation. (arXiv:2205.06296v1 [cs.IR])
    User evaluations include a significant quantity of information across online platforms. This information source has been neglected by the majority of existing recommendation systems, despite its potential to ease the sparsity issue and enhance the quality of suggestions. This work presents a deep model for concurrently learning item attributes and user behaviour from review text. Deep Cooperative Neural Networks (DeepCoNN) is the suggested model consisting of two parallel neural networks connected in their final layers. One of the networks focuses on learning user behaviour from reviews submitted by the user, while the other network learns item attributes from user reviews. On top, a shared layer is added to connect these two networks. Similar to factorization machine approaches, the shared layer allows latent factors acquired for people and things to interact with each other. On a number of datasets, DeepCoNN surpasses all baseline recommendation systems, according to experimental findings.  ( 2 min )
    Collaborative Multi-agent Stochastic Linear Bandits. (arXiv:2205.06331v1 [cs.LG])
    We study a collaborative multi-agent stochastic linear bandit setting, where $N$ agents that form a network communicate locally to minimize their overall regret. In this setting, each agent has its own linear bandit problem (its own reward parameter) and the goal is to select the best global action w.r.t. the average of their reward parameters. At each round, each agent proposes an action, and one action is randomly selected and played as the network action. All the agents observe the corresponding rewards of the played actions and use an accelerated consensus procedure to compute an estimate of the average of the rewards obtained by all the agents. We propose a distributed upper confidence bound (UCB) algorithm and prove a high probability bound on its $T$-round regret in which we include a linear growth of regret associated with each communication round. Our regret bound is of order $\mathcal{O}\Big(\sqrt{\frac{T}{N \log(1/|\lambda_2|)}}\cdot (\log T)^2\Big)$, where $\lambda_2$ is the second largest (in absolute value) eigenvalue of the communication matrix.  ( 2 min )
    Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations. (arXiv:2205.06333v1 [cs.RO])
    Perceptual understanding of the scene and the relationship between its different components is important for successful completion of robotic tasks. Representation learning has been shown to be a powerful technique for this, but most of the current methodologies learn task specific representations that do not necessarily transfer well to other tasks. Furthermore, representations learned by supervised methods require large labeled datasets for each task that are expensive to collect in the real world. Using self-supervised learning to obtain representations from unlabeled data can mitigate this problem. However, current self-supervised representation learning methods are mostly object agnostic, and we demonstrate that the resulting representations are insufficient for general purpose robotics tasks as they fail to capture the complexity of scenes with many components. In this paper, we explore the effectiveness of using object-aware representation learning techniques for robotic tasks. Our self-supervised representations are learned by observing the agent freely interacting with different parts of the environment and is queried in two different settings: (i) policy learning and (ii) object location prediction. We show that our model learns control policies in a sample-efficient manner and outperforms state-of-the-art object agnostic techniques as well as methods trained on raw RGB images. Our results show a 20 percent increase in performance in low data regimes (1000 trajectories) in policy training using implicit behavioral cloning (IBC). Furthermore, our method outperforms the baselines for the task of object localization in multi-object scenes.  ( 2 min )
    Multi-Environment Meta-Learning in Stochastic Linear Bandits. (arXiv:2205.06326v1 [cs.LG])
    In this work we investigate meta-learning (or learning-to-learn) approaches in multi-task linear stochastic bandit problems that can originate from multiple environments. Inspired by the work of [1] on meta-learning in a sequence of linear bandit problems whose parameters are sampled from a single distribution (i.e., a single environment), here we consider the feasibility of meta-learning when task parameters are drawn from a mixture distribution instead. For this problem, we propose a regularized version of the OFUL algorithm that, when trained on tasks with labeled environments, achieves low regret on a new task without requiring knowledge of the environment from which the new task originates. Specifically, our regret bound for the new algorithm captures the effect of environment misclassification and highlights the benefits over learning each task separately or meta-learning without recognition of the distinct mixture components.  ( 2 min )
    Detailed Balanced Chemical Reaction Networks as Generalized Boltzmann Machines. (arXiv:2205.06313v1 [q-bio.MN])
    Can a micron sized sack of interacting molecules understand, and adapt to a constantly-fluctuating environment? Cellular life provides an existence proof in the affirmative, but the principles that allow for life's existence are far from being proven. One challenge in engineering and understanding biochemical computation is the intrinsic noise due to chemical fluctuations. In this paper, we draw insights from machine learning theory, chemical reaction network theory, and statistical physics to show that the broad and biologically relevant class of detailed balanced chemical reaction networks is capable of representing and conditioning complex distributions. These results illustrate how a biochemical computer can use intrinsic chemical noise to perform complex computations. Furthermore, we use our explicit physical model to derive thermodynamic costs of inference.  ( 2 min )
    Design and Implementation of a Quantum Kernel for Natural Language Processing. (arXiv:2205.06409v1 [cs.CL])
    Natural language processing (NLP) is the field that attempts to make human language accessible to computers, and it relies on applying a mathematical model to express the meaning of symbolic language. One such model, DisCoCat, defines how to express both the meaning of individual words as well as their compositional nature. This model can be naturally implemented on quantum computers, leading to the field quantum NLP (QNLP). Recent experimental work used quantum machine learning techniques to map from text to class label using the expectation value of the quantum encoded sentence. Theoretical work has been done on computing the similarity of sentences but relies on an unrealized quantum memory store. The main goal of this thesis is to leverage the DisCoCat model to design a quantum-based kernel function that can be used by a support vector machine (SVM) for NLP tasks. Two similarity measures were studied: (i) the transition amplitude approach and (ii) the SWAP test. A simple NLP meaning classification task from previous work was used to train the word embeddings and evaluate the performance of both models. The Python module lambeq and its related software stack was used for implementation. The explicit model from previous work was used to train word embeddings and achieved a testing accuracy of $93.09 \pm 0.01$%. It was shown that both the SVM variants achieved a higher testing accuracy of $95.72 \pm 0.01$% for approach (i) and $97.14 \pm 0.01$% for (ii). The SWAP test was then simulated under a noise model defined by the real quantum device, ibmq_guadalupe. The explicit model achieved an accuracy of $91.94 \pm 0.01$% while the SWAP test SVM achieved 96.7% on the testing dataset, suggesting that the kernelized classifiers are resilient to noise. These are encouraging results and motivate further investigations of our proposed kernelized QNLP paradigm.  ( 2 min )
    Improving Sequential Query Recommendation with Immediate User Feedback. (arXiv:2205.06297v1 [cs.IR])
    We propose an algorithm for next query recommendation in interactive data exploration settings, like knowledge discovery for information gathering. The state-of-the-art query recommendation algorithms are based on sequence-to-sequence learning approaches that exploit historical interaction data. We propose to augment the transformer-based causal language models for query recommendations to adapt to the immediate user feedback using multi-armed bandit (MAB) framework. We conduct a large-scale experimental study using log files from a popular online literature discovery service and demonstrate that our algorithm improves the cumulative regret substantially, with respect to the state-of-the-art transformer-based query recommendation models, which do not make use of the immediate user feedback. Our data model and source code are available at ~\url{https://anonymous.4open.science/r/exp3_ss-9985/}.  ( 2 min )
    Modularity in NEAT Reinforcement Learning Networks. (arXiv:2205.06451v1 [cs.NE])
    Modularity is essential to many well-performing structured systems, as it is a useful means of managing complexity [8]. An analysis of modularity in neural networks produced by machine learning algorithms can offer valuable insight into the workings of such algorithms and how modularity can be leveraged to improve performance. However, this property is often overlooked in the neuroevolutionary literature, so the modular nature of many learning algorithms is unknown. This property was assessed on the popular algorithm "NeuroEvolution of Augmenting Topologies" (NEAT) for standard simulation benchmark control problems due to NEAT's ability to optimise network topology. This paper shows that NEAT networks seem to rapidly increase in modularity over time with the rate and convergence dependent on the problem. Interestingly, NEAT tends towards increasingly modular networks even when network fitness converges. It was shown that the ideal level of network modularity in the explored parameter space is highly dependent on other network variables, dispelling theories that modularity has a straightforward relationship to network performance. This is further proven in this paper by demonstrating that rewarding modularity directly did not improve fitness.  ( 2 min )
  • Open

    On the Existence of Simpler Machine Learning Models. (arXiv:1908.01755v4 [cs.LG] UPDATED)
    It is almost always easier to find an accurate-but-complex model than an accurate-yet-simple model. Finding optimal, sparse, accurate models of various forms (linear models with integer coefficients, decision sets, rule lists, decision trees) is generally NP-hard. We often do not know whether the search for a simpler model will be worthwhile, and thus we do not go to the trouble of searching for one. In this work, we ask an important practical question: can accurate-yet-simple models be proven to exist, or shown likely to exist, before explicitly searching for them? We hypothesize that there is an important reason that simple-yet-accurate models often do exist. This hypothesis is that the size of the Rashomon set is often large, where the Rashomon set is the set of almost-equally-accurate models from a function class. If the Rashomon set is large, it contains numerous accurate models, and perhaps at least one of them is the simple model we desire. In this work, we formally present the Rashomon ratio as a new gauge of simplicity for a learning problem, depending on a function class and a data set. The Rashomon ratio is the ratio of the volume of the set of accurate models to the volume of the hypothesis space, and it is different from standard complexity measures from statistical learning theory. Insight from studying the Rashomon ratio provides an easy way to check whether a simpler model might exist for a problem before finding it, namely whether several different machine learning methods achieve similar performance on the data. In that sense, the Rashomon ratio is a powerful tool for understanding why and when an accurate-yet-simple model might exist. If, as we hypothesize in this work, many real-world data sets admit large Rashomon sets, the implications are vast: it means that simple or interpretable models may often be used for high-stakes decisions without losing accuracy.
    An Equivalence Principle for the Spectrum of Random Inner-Product Kernel Matrices. (arXiv:2205.06308v1 [math.PR])
    We consider random matrices whose entries are obtained by applying a (nonlinear) kernel function to the pairwise inner products between $n$ independent data vectors drawn uniformly from the unit sphere in $\mathbb{R}^d$. Our study of this model is motivated by problems in machine learning, statistics, and signal processing, where such inner-product kernel random matrices and their spectral properties play important roles. Under mild conditions on the kernel function, we establish the weak-limit of the empirical spectral distribution of these matrices when $d, n \to \infty$ such that $n / d^\ell \to \kappa \in (0, \infty)$, for some fixed $\ell \in \mathbb{N}$ and $\kappa \in \mathbb{R}$. This generalizes an earlier result of Cheng and Singer (2013), who studied the same model in the linear scaling regime (with $\ell = 1$ and $n/d \to \kappa$). The main insight of our work is a general equivalence principle: the spectrum of the random kernel matrix is asymptotically equivalent to that of a simpler matrix model, constructed as the linear combination of a (shifted) Wishart matrix and an independent matrix drawn from the Gaussian orthogonal ensemble. The aspect ratio of the Wishart matrix and the coefficients of the linear combination are determined by $\ell$ and by the expansion of the kernel function in the orthogonal Hermite polynomial basis. Consequently, the limiting spectrum of the random kernel matrix can be characterized as the free additive convolution between a Marchenko-Pastur law and a semicircle law.  ( 2 min )
    Heavy-Tail Phenomenon in Decentralized SGD. (arXiv:2205.06689v1 [stat.ML])
    Recent theoretical studies have shown that heavy-tails can emerge in stochastic optimization due to `multiplicative noise', even under surprisingly simple settings, such as linear regression with Gaussian data. While these studies have uncovered several interesting phenomena, they consider conventional stochastic optimization problems, which exclude decentralized settings that naturally arise in modern machine learning applications. In this paper, we study the emergence of heavy-tails in decentralized stochastic gradient descent (DE-SGD), and investigate the effect of decentralization on the tail behavior. We first show that, when the loss function at each computational node is twice continuously differentiable and strongly convex outside a compact region, the law of the DE-SGD iterates converges to a distribution with polynomially decaying (heavy) tails. To have a more explicit control on the tail exponent, we then consider the case where the loss at each node is a quadratic, and show that the tail-index can be estimated as a function of the step-size, batch-size, and the topological properties of the network of the computational nodes. Then, we provide theoretical and empirical results showing that DE-SGD has heavier tails than centralized SGD. We also compare DE-SGD to disconnected SGD where nodes distribute the data but do not communicate. Our theory uncovers an interesting interplay between the tails and the network structure: we identify two regimes of parameters (stepsize and network size), where DE-SGD %addition of network structure can have lighter or heavier tails than disconnected SGD depending on the regime. Finally, to support our theoretical results, we provide numerical experiments conducted on both synthetic data and neural networks.
    Probabilistic Estimation of Chirp Instantaneous Frequency Using Gaussian Processes. (arXiv:2205.06306v1 [stat.ML])
    We present a probabilistic approach for estimating chirp signal and its instantaneous frequency function when the true forms of the chirp and instantaneous frequency are unknown. To do so, we represent them by joint cascading Gaussian processes governed by a non-linear stochastic differential equation, and estimate their posterior distribution by using stochastic filters and smoothers. The model parameters are determined via maximum likelihood estimation. Theoretical results show that the estimation method has a bounded mean squared error. Experiments show that the method outperforms a number of baseline methods on a synthetic model, and we also apply the method to analyse a gravitational wave data.  ( 2 min )
    Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions. (arXiv:2205.06811v1 [cs.LG])
    We study the linear contextual bandit problem in the presence of adversarial corruption, where the reward at each round is corrupted by an adversary, and the corruption level (i.e., the sum of corruption magnitudes over the horizon) is $C\geq 0$. The best-known algorithms in this setting are limited in that they either are computationally inefficient or require a strong assumption on the corruption, or their regret is at least $C$ times worse than the regret without corruption. In this paper, to overcome these limitations, we propose a new algorithm based on the principle of optimism in the face of uncertainty. At the core of our algorithm is a weighted ridge regression where the weight of each chosen action depends on its confidence up to some threshold. We show that for both known $C$ and unknown $C$ cases, our algorithm with proper choice of hyperparameter achieves a regret that nearly matches the lower bounds. Thus, our algorithm is nearly optimal up to logarithmic factors for both cases. Notably, our algorithm achieves the near-optimal regret for both corrupted and uncorrupted cases ($C=0$) simultaneously.
    Fast Conditional Network Compression Using Bayesian HyperNetworks. (arXiv:2205.06404v1 [cs.LG])
    We introduce a conditional compression problem and propose a fast framework for tackling it. The problem is how to quickly compress a pretrained large neural network into optimal smaller networks given target contexts, e.g. a context involving only a subset of classes or a context where only limited compute resource is available. To solve this, we propose an efficient Bayesian framework to compress a given large network into much smaller size tailored to meet each contextual requirement. We employ a hypernetwork to parameterize the posterior distribution of weights given conditional inputs and minimize a variational objective of this Bayesian neural network. To further reduce the network sizes, we propose a new input-output group sparsity factorization of weights to encourage more sparseness in the generated weights. Our methods can quickly generate compressed networks with significantly smaller sizes than baseline methods.  ( 2 min )
    Weak consistency of the 1-nearest neighbor measure with applications to missing data. (arXiv:1902.02408v3 [math.ST] UPDATED)
    When data is partially missing at random, imputation and importance weighting are often used to estimate moments of the unobserved population. In this paper, we study 1-nearest neighbor (1NN) importance weighting, which estimates moments by replacing missing data with the complete data that is the nearest neighbor in the non-missing covariate space. We define an empirical measure, the 1NN measure, and show that it is weakly consistent for the measure of the missing data. The main idea behind this result is that the 1NN measure is performing inverse probability weighting in the limit. We study applications to missing data and mitigating the impact of covariate shift in prediction tasks.
    Robust and Heterogenous Odds Ratio: Estimating Price Sensitivity for Unbought Items. (arXiv:2106.11389v2 [stat.ME] UPDATED)
    Problem definition: Mining for heterogeneous responses to an intervention is a crucial step for data-driven operations, for instance to personalize treatment or pricing. We investigate how to estimate price sensitivity from transaction-level data. In causal inference terms, we estimate heterogeneous treatment effects when (a) the response to treatment (here, whether a customer buys a product) is binary, and (b) treatment assignments are partially observed (here, full information is only available for purchased items). Methodology/Results: We propose a recursive partitioning procedure to estimate heterogeneous odds ratio, a widely used measure of treatment effect in medicine and social sciences. We integrate an adversarial imputation step to allow for robust estimation even in presence of partially observed treatment assignments. We validate our methodology on synthetic data and apply it to three case studies from political science, medicine, and revenue management. Managerial Implications: Our robust heterogeneous odds ratio estimation method is a simple and intuitive tool to quantify heterogeneity in patients or customers and personalize interventions, while lifting a central limitation in many revenue management data.
    secml: A Python Library for Secure and Explainable Machine Learning. (arXiv:1912.10013v2 [cs.LG] UPDATED)
    We present \texttt{secml}, an open-source Python library for secure and explainable machine learning. It implements the most popular attacks against machine learning, including test-time evasion attacks to generate adversarial examples against deep neural networks and training-time poisoning attacks against support vector machines and many other algorithms. These attacks enable evaluating the security of learning algorithms and the corresponding defenses under both white-box and black-box threat models. To this end, \texttt{secml} provides built-in functions to compute security evaluation curves, showing how quickly classification performance decreases against increasing adversarial perturbations of the input data. \texttt{secml} also includes explainability methods to help understand why adversarial attacks succeed against a given model, by visualizing the most influential features and training prototypes contributing to each decision. It is distributed under the Apache License 2.0 and hosted at \url{https://github.com/pralab/secml}.
    The interplay between ranking and communities in networks. (arXiv:2112.12670v2 [cs.SI] UPDATED)
    Community detection and hierarchy extraction are usually thought of as separate inference tasks on networks. Considering only one of the two when studying real-world data can be an oversimplification. In this work, we present a generative model based on an interplay between community and hierarchical structures. It assumes that each node has a preference in the interaction mechanism and nodes with the same preference are more likely to interact, while heterogeneous interactions are still allowed. The sparsity of the network is exploited for implementing a more efficient algorithm. We demonstrate our method on synthetic and real-world data and compare performance with two standard approaches for community detection and ranking extraction. We find that the algorithm accurately retrieves the overall node's preference in different scenarios, and we show that it can distinguish small subsets of nodes that behave differently than the majority. As a consequence, the model can recognize whether a network has an overall preferred interaction mechanism. This is relevant in situations where there is no clear "a priori" information about what structure explains the observed network datasets well. Our model allows practitioners to learn this automatically from the data.
    Anomaly Detection using Principles of Human Perception. (arXiv:2103.12323v4 [cs.CR] UPDATED)
    In the fields of statistics and unsupervised machine learning a fundamental and well-studied problem is anomaly detection. Anomalies are difficult to define, yet many algorithms have been proposed. Underlying the approaches is the nebulous understanding that anomalies are rare, unusual or inconsistent with the majority of data. The present work provides a philosophical treatise to clearly define anomalies and develops an algorithm for their efficient detection with minimal user intervention. Inspired by the Gestalt School of Psychology and the Helmholtz principle of human perception, anomalies are assumed to be observations that are unexpected to occur with respect to certain groupings made by the majority of the data. Under appropriate random variable modelling anomalies are directly found in a set of data by a uniform and independent random assumption of the distribution of constituent elements of the observations, with anomalies corresponding to those observations where the expectation of the number of occurrences of the elements in a given view is $<1$. Starting from fundamental principles of human perception an unsupervised anomaly detection algorithm is developed that is simple, real-time and parameter-free. Experiments suggest it as a competing choice for univariate data with promising results on the detection of global anomalies in multivariate data.
    $\alpha$-GAN: Convergence and Estimation Guarantees. (arXiv:2205.06393v1 [cs.LG])
    We prove a two-way correspondence between the min-max optimization of general CPE loss function GANs and the minimization of associated $f$-divergences. We then focus on $\alpha$-GAN, defined via the $\alpha$-loss, which interpolates several GANs (Hellinger, vanilla, Total Variation) and corresponds to the minimization of the Arimoto divergence. We show that the Arimoto divergences induced by $\alpha$-GAN equivalently converge, for all $\alpha\in \mathbb{R}_{>0}\cup\{\infty\}$. However, under restricted learning models and finite samples, we provide estimation bounds which indicate diverse GAN behavior as a function of $\alpha$. Finally, we present empirical results on a toy dataset that highlight the practical utility of tuning the $\alpha$ hyperparameter.
    Differentiable Graph Module (DGM) for Graph Convolutional Networks. (arXiv:2002.04999v4 [cs.LG] UPDATED)
    Graph deep learning has recently emerged as a powerful ML concept allowing to generalize successful deep neural architectures to non-Euclidean structured data. Such methods have shown promising results on a broad spectrum of applications ranging from social science, biomedicine, and particle physics to computer vision, graphics, and chemistry. One of the limitations of the majority of current graph neural network architectures is that they are often restricted to the transductive setting and rely on the assumption that the underlying graph is {\em known} and {\em fixed}. Often, this assumption is not true since the graph may be noisy, or partially and even completely unknown. In such cases, it would be helpful to infer the graph directly from the data, especially in inductive settings where some nodes were not present in the graph at training time. Furthermore, learning a graph may become an end in itself, as the inferred structure may provide complementary insights next to the downstream task. In this paper, we introduce Differentiable Graph Module (DGM), a learnable function that predicts edge probabilities in the graph which are optimal for the downstream task. DGM can be combined with convolutional graph neural network layers and trained in an end-to-end fashion. We provide an extensive evaluation of applications from the domains of healthcare (disease prediction), brain imaging (age prediction), computer graphics (3D point cloud segmentation), and computer vision (zero-shot learning). We show that our model provides a significant improvement over baselines both in transductive and inductive settings and achieves state-of-the-art results.
    Upside-Down Reinforcement Learning Can Diverge in Stochastic Environments With Episodic Resets. (arXiv:2205.06595v1 [stat.ML])
    Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) -- which can be viewed as a simplified version of UDRL -- optimizes a lower bound on goal-reaching performance. This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments, similar to certain well-known traditional RL algorithms. Here we show that for a specific episodic UDRL algorithm (eUDRL, including GCSL), this is not the case, and give the causes of this limitation. To do so, we first introduce a helpful rewrite of eUDRL as a recursive policy update. This formulation helps to disprove its convergence to the optimal policy for a wide class of stochastic environments. Finally, we provide a concrete example of a very simple environment where eUDRL diverges. Since the primary aim of this paper is to present a negative result, and the best counterexamples are the simplest ones, we restrict all discussions to finite (discrete) environments, ignoring issues of function approximation and limited sample size.
    Variational Hyper-Encoding Networks. (arXiv:2005.08482v2 [stat.ML] UPDATED)
    We propose a framework called HyperVAE for encoding distributions of distributions. When a target distribution is modeled by a VAE, its neural network parameters \theta is drawn from a distribution p(\theta) which is modeled by a hyper-level VAE. We propose a variational inference using Gaussian mixture models to implicitly encode the parameters \theta into a low dimensional Gaussian distribution. Given a target distribution, we predict the posterior distribution of the latent code, then use a matrix-network decoder to generate a posterior distribution q(\theta). HyperVAE can encode the parameters \theta in full in contrast to common hyper-networks practices, which generate only the scale and bias vectors as target-network parameters. Thus HyperVAE preserves much more information about the model for each task in the latent space. We discuss HyperVAE using the minimum description length (MDL) principle and show that it helps HyperVAE to generalize. We evaluate HyperVAE in density estimation tasks, outlier detection and discovery of novel design classes, demonstrating its efficacy.
    Multiple Domain Causal Networks. (arXiv:2205.06791v1 [stat.ML])
    Observational studies are regarded as economic alternatives to randomized trials, often used in their stead to investigate and determine treatment efficacy. Due to lack of sample size, observational studies commonly combine data from multiple sources or different sites/centers. Despite the benefits of an increased sample size, a naive combination of multicenter data may result in incongruities stemming from center-specific protocols for generating cohorts or reactions towards treatments distinct to a given center, among other things. These issues arise in a variety of other contexts, including capturing a treatment effect related to an individual's unique biological characteristics. Existing methods for estimating heterogeneous treatment effects have not adequately addressed the multicenter context, but rather treat it simply as a means to obtain sufficient sample size. Additionally, previous approaches to estimating treatment effects do not straightforwardly generalize to the multicenter design, especially when required to provide treatment insights for patients from a new, unobserved center. To address these shortcomings, we propose Multiple Domain Causal Networks (MDCN), an approach that simultaneously strengthens the information sharing between similar centers while addressing the selection bias in treatment assignment through learning of a new feature embedding. In empirical evaluations, MDCN is consistently more accurate when estimating the heterogeneous treatment effect in new centers compared to benchmarks that adjust solely based on treatment imbalance or general center differences. Finally, we justify our approach by providing theoretical analyses that demonstrate that MDCN improves on the generalization bound of the new, unobserved target center.
    Clustering with missing data: which equivalent for Rubin's rules?. (arXiv:2011.13694v2 [stat.ME] UPDATED)
    Multiple imputation (MI) is a popular method for dealing with missing values. However, the suitable way for applying clustering after MI remains unclear: how to pool partitions? How to assess the clustering instability when data are incomplete? By answering both questions, this paper proposed a complete view of clustering with missing data using MI. The problem of partitions pooling is here addressed using consensus clustering while, based on the bootstrap theory, we explain how to assess the instability related to observed and missing data. The new rules for pooling partitions and instability assessment are theoretically argued and extensively studied by simulation. Partitions pooling improves accuracy, while measuring instability with missing data enlarges the data analysis possibilities: it allows assessment of the dependence of the clustering to the imputation model, as well as a convenient way for choosing the number of clusters when data are incomplete, as illustrated on a real data set.
    The Design Space of E(3)-Equivariant Atom-Centered Interatomic Potentials. (arXiv:2205.06643v1 [stat.ML])
    The rapid progress of machine learning interatomic potentials over the past couple of years produced a number of new architectures. Particularly notable among these are the Atomic Cluster Expansion (ACE), which unified many of the earlier ideas around atom density-based descriptors, and Neural Equivariant Interatomic Potentials (NequIP), a message passing neural network with equivariant features that showed state of the art accuracy. In this work, we construct a mathematical framework that unifies these models: ACE is generalised so that it can be recast as one layer of a multi-layer architecture. From another point of view, the linearised version of NequIP is understood as a particular sparsification of a much larger polynomial model. Our framework also provides a practical tool for systematically probing different choices in the unified design space. We demonstrate this by an ablation study of NequIP via a set of experiments looking at in- and out-of-domain accuracy and smooth extrapolation very far from the training data, and shed some light on which design choices are critical for achieving high accuracy. Finally, we present BOTNet (Body-Ordered-Tensor-Network), a much-simplified version of NequIP, which has an interpretable architecture and maintains accuracy on benchmark datasets.
    Generalized Variational Inference in Function Spaces: Gaussian Measures meet Bayesian Deep Learning. (arXiv:2205.06342v1 [stat.ML])
    We develop a framework for generalized variational inference in infinite-dimensional function spaces and use it to construct a method termed Gaussian Wasserstein inference (GWI). GWI leverages the Wasserstein distance between Gaussian measures on the Hilbert space of square-integrable functions in order to determine a variational posterior using a tractable optimisation criterion and avoids pathologies arising in standard variational function space inference. An exciting application of GWI is the ability to use deep neural networks in the variational parametrisation of GWI, combining their superior predictive performance with the principled uncertainty quantification analogous to that of Gaussian processes. The proposed method obtains state-of-the-art performance on several benchmark datasets.
    Explaining by Removing: A Unified Framework for Model Explanation. (arXiv:2011.14878v2 [cs.LG] UPDATED)
    Researchers have proposed a wide variety of model explanation approaches, but it remains unclear how most methods are related or when one method is preferable to another. We describe a new unified class of methods, removal-based explanations, that are based on the principle of simulating feature removal to quantify each feature's influence. These methods vary in several respects, so we develop a framework that characterizes each method along three dimensions: 1) how the method removes features, 2) what model behavior the method explains, and 3) how the method summarizes each feature's influence. Our framework unifies 26 existing methods, including several of the most widely used approaches: SHAP, LIME, Meaningful Perturbations, and permutation tests. This newly understood class of explanation methods has rich connections that we examine using tools that have been largely overlooked by the explainability literature. To anchor removal-based explanations in cognitive psychology, we show that feature removal is a simple application of subtractive counterfactual reasoning. Ideas from cooperative game theory shed light on the relationships and trade-offs among different methods, and we derive conditions under which all removal-based explanations have information-theoretic interpretations. Through this analysis, we develop a unified framework that helps practitioners better understand model explanation tools, and that offers a strong theoretical foundation upon which future explainability research can build.
    Improving Sequential Query Recommendation with Immediate User Feedback. (arXiv:2205.06297v1 [cs.IR])
    We propose an algorithm for next query recommendation in interactive data exploration settings, like knowledge discovery for information gathering. The state-of-the-art query recommendation algorithms are based on sequence-to-sequence learning approaches that exploit historical interaction data. We propose to augment the transformer-based causal language models for query recommendations to adapt to the immediate user feedback using multi-armed bandit (MAB) framework. We conduct a large-scale experimental study using log files from a popular online literature discovery service and demonstrate that our algorithm improves the cumulative regret substantially, with respect to the state-of-the-art transformer-based query recommendation models, which do not make use of the immediate user feedback. Our data model and source code are available at ~\url{https://anonymous.4open.science/r/exp3_ss-9985/}.
    EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits. (arXiv:2110.03177v8 [cs.LG] UPDATED)
    In this paper, we propose a novel neural exploration strategy in contextual bandits, EE-Net, distinct from the standard UCB-based and TS-based approaches. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration tradeoff in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, linear contextual bandits have adopted ridge regression to estimate the reward function and combine it with TS or UCB strategies for exploration. However, this line of works explicitly assumes the reward is based on a linear function of arm vectors, which may not be true in real-world datasets. To overcome this challenge, a series of neural bandit algorithms have been proposed, where a neural network is used to learn the underlying reward function and TS or UCB are adapted for exploration. Instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose "EE-Net", a novel neural-based exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn potential gains compared to the currently estimated reward for exploration. Then, a decision-maker is constructed to combine the outputs from the Exploitation and Exploration networks. We prove that EE-Net can achieve $\mathcal{O}(\sqrt{T\log T})$ regret and show that EE-Net outperforms existing linear and neural contextual bandit baselines on real-world datasets.
    Change-point Detection and Segmentation of Discrete Data using Bayesian Context Trees. (arXiv:2203.04341v2 [stat.ME] UPDATED)
    A new Bayesian modelling framework is introduced for piece-wise homogeneous variable-memory Markov chains, along with a collection of effective algorithmic tools for change-point detection and segmentation of discrete time series. Building on the recently introduced Bayesian Context Trees (BCT) framework, the distributions of different segments in a discrete time series are described as variable-memory Markov chains. Inference for the presence and location of change-points is then performed via Markov chain Monte Carlo sampling. The key observation that facilitates effective sampling is that, using one of the BCT algorithms, the prior predictive likelihood of the data can be computed exactly, integrating out all the models and parameters in each segment. This makes it possible to sample directly from the posterior distribution of the number and location of the change-points, leading to accurate estimates and providing a natural quantitative measure of uncertainty in the results. Estimates of the actual model in each segment can also be obtained, at essentially no additional computational cost. Results on both simulated and real-world data indicate that the proposed methodology performs better than or as well as state-of-the-art techniques.

  • Open

    [D] Multi-modal information sharing strategies when one "modality" is tabular data.
    I'm hunting for literature on architectures for multimodal autoencoders when one "modality" is tabular data. Keyword searches aren't getting me what I want though. I have a mixture of sequential, audio, image, and tabular data. For the non-tabular modalities it seems like cross attention in the encoder/decoder is the obvious choice, but I'm not sure how to best incorporate the tabular component. Options seem like 1) use a standard ffn on the tabular data and connect it to the shared latent space. But this seems not ideal if there are feature interactions between tabular and other modalities. Or, 2) put tabular data through a ffn and repeat and concatenate it to the input of each other modality. Then add some pooling layers in the decoder then a fc layer for reconstruction of the tabular data. This seems inefficient because you replicate the tabular data across modalities. Is there anything as elegant as cross attention but for cross modal fusion with tabular data? submitted by /u/WigglyHypersurface [link] [comments]  ( 1 min )
    [D] Any primer on all the major image generators and how they compare to each other?
    It seems every other week another image generator comes out. Mostly countless Dalle versions at this point. ru-Dalle, mini-Dalle, dalle flow, dalle mega. Other than that the original Dalle and dalle 2 are from Open AI and presumably are better than all the others, I know very little about them. Is there any consolidated primer on what are the major/most promising image generators, what makes them unique, what source/interface are available for them and how they stack up against each other, parameters etc? Doesn't need to be superdetailed. Just a short orientation. submitted by /u/cyborgsnowflake [link] [comments]  ( 1 min )
    [D] Is the standard path of a researcher not effective anymore?
    Hello All! I am currently a master's student. I want to work as an AI researcher, so the typical path I know of is pursuing a Ph.D. and then trying to join a respected research lab. However, I have stumbled upon a couple of discussions in blog posts, Twitter, and in this subreddit suggesting that the current progress of those labs is more engineering than science; most current research papers, including ones at top conferences, aim at the incremental improvement of current methodologies instead of breakthroughs and without any understanding of why these models work. A reason for this appears to be that research in deep learning is mainly influenced by giant tech companies that favor short-term progress instead of long-term ones. Another one appears to be the broken incentive structures in research in general and these broken incentives' bad influence have been amplified in this field leading researchers to become paper producers rather than progress producers. So, am I missing the big picture? Are there other paths? Are the incentive structures too broken that someone could make better progress by doing research on their own or by joining open labs like ML Collective? submitted by /u/TryingToGeek [link] [comments]  ( 2 min )
    [R] Locating and Editing Factual Knowledge in GPT
    https://i.redd.it/rktep95sdpz81.gif Here's an interesting paper from MIT CSAIL on a method to edit knowledge within language models — sets up a nice framework for knowledge editing and representation, and a reference dataset to test facts within a model. Also has full code & colab notebooks to edit your own facts within GPT: https://rome.baulab.info/ + https://github.com/kmeng01/rome + https://colab.research.google.com/github/kmeng01/rome/blob/main/notebooks/rome.ipynb. Abstract: We investigate the mechanisms underlying factual knowledge recall in autoregressive transformer language models. First, we develop a causal intervention for identifying neuron activations capable of altering a model’s factual predictions. Within large GPT-style models, this reveals two distinct sets of neurons that we hypothesize correspond to knowing an abstract fact and saying a concrete word, respectively. This insight inspires the development of ROME, a novel method for editing facts stored in model weights. For evaluation, we assemble COUNTERFACT, a dataset of over twenty thousand counterfactuals and tools to facilitate sensitive measurements of knowledge editing. Using COUNTERFACT, we confirm the distinction between saying and knowing neurons, and we find that ROME achieves state-of-the-art performance in knowledge editing compared to other methods. An interactive demo notebook, full code implementation, and the dataset are available at https://rome.baulab.info/. submitted by /u/Quantum_Network [link] [comments]  ( 1 min )
    [D] A Brief History & The Current State of Anime Generating AI
    submitted by /u/farfromhome2020 [link] [comments]
    [N] Is Deepmind's "Gato" a precursor for general artificial intelligence? According to Gary Marcus, most certainly not.
    submitted by /u/much_successes [link] [comments]  ( 2 min )
    [D] If random search works so well, why is there no published paper that compares it to RL?
    This is not a sarcastic question! I'm reading the excellent series An Outsider's Tour of Reinforcement Learning and was very interested in finding out that simple random search random search is actually a very good algorithm in a good variety of benchmarks. If simple random search works so well and, being fast, allows us to evaluate performance more thoroughly, why isn't there a more developed line of research exploring its application to optimal control? Why is there no paper saying "hey RL community, this is the new state of the art"? submitted by /u/GorillaWithGroove [link] [comments]  ( 2 min )
    [P] DALL·E Mega Website
    submitted by /u/tomd_96 [link] [comments]  ( 1 min )
    [D] Can I re-publish a Medium post later as an arXiv paper?
    Does anyone know about legal implications of either platform? Has anybody done a similar thing, e.g. for better citation? Thanks submitted by /u/ihatebeinganonymous [link] [comments]  ( 1 min )
    [R] Symphony Generation with Permutation Invariant Language Model
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 2 min )
    [D] What is the best practice for adding new samples to the training set of a model once a new sample is discovered/acquired?
    Let’s say I have a trained model for which I used a training set T. If new samples become available, what is the best practice for using this new information to improve the model? My thoughts are: 1) create a new training set T’ that contains the new samples and retrain the model 2) fine-tune the trained model by training a few epochs on only the new samples Furthermore, if the new acquired samples are more important to the task than the ones in the original T, is there a way I can bias the training towards the new samples? submitted by /u/StOchastiC_ [link] [comments]  ( 2 min )
    [R] Are there any publicly available models for non-autoregressive text generation?
    I am building a non-autoregressive text generation model. For evaluation I use recently proposed MAUVE score. The problem is that all previous works use different metrics, such as BLEU/self-BLEU etc. To compare my model with the previous ones I want to evaluate MAUVE score on their output. But I couldn't find any downloadable trained models for recent papers such as SUNDAE or ScratchGAN. On hugginface there are plenty of gpt-like moldels (autoregressive), but I don't know how to find non-autoregressive ones there. submitted by /u/Tomarchelone [link] [comments]  ( 1 min )
    [D] Open Source library to do automatic EDA + experiment tracking in a spreadsheet
    Hi All, We have released a new version of python library, VevestaX. The library does automatic EDA and experiment tracking in a spreadsheet. The library can be downloaded using: pip install vevestaX Following is the link to its demo: https://youtu.be/7jmnIOqBpJM Following is the github link: https://github.com/Vevesta/VevestaX/blob/main/README.md Following is the sample output spreadsheet: https://docs.google.com/spreadsheets/d/15lOXzpcUQtkYQAEnx-YTegvg8zCW6pEK/edit?usp=sharing&ouid=103382336064969333270&rtpof=true&sd=true Please give us a github star, it would mean the world for us. Please mail your feature requests to OP at vevestax@vevesta.com submitted by /u/vevesta [link] [comments]  ( 1 min )
  • Open

    Methods to guarantee stability of actor-critic based learning ?
    Hi, coming from a control systems background, is there any mathematical trickery to prove/guarantee/99% guarantee the stability of a ⠀converged policy? I am thinking something appreciate DDPG to start with? submitted by /u/_Arrietty [link] [comments]  ( 1 min )
    Methods to guarantee stability of actor-critic based learning ?
    Hi, coming from a control systems background, is there any mathematical trickery to ‏‏‎ prove/guarantee/99% guarantee the stability of a converged policy? I am thinking something appreciate DDPG to start with? submitted by /u/_Arrietty [link] [comments]  ( 1 min )
    Sampling a probabilistic action space for DQN
    I think its relatively well understood why we wouldn't sample probabilistic values for actions taken during learning: as the policy-learning is a regression problem, thus linear output (rather than activating the output with Relu/Softmax) of feature layers is necessary. However I want to know whether we can (or even should) be sampling the action space using softmax for non-learning steps. For eample, during training the DQN cycle is: record current observation -> take action -> record next observation -> save transition, (s, a, r, s') + some termination indicator -> learn step (sample batch/mini-batch of transitions to learn from). In this case, what would happen if we took a probabilistic distibution of actions (e.g. softmax on output of q-network) during exploration (the non-learning step)? Would this even do anything? ​ (Correct me if I have mispoken anywhere, thank you) submitted by /u/Background-Cable-491 [link] [comments]  ( 1 min )
    Loss doesn't decrease in Deep Q Learning
    I am training a DQL NN to learn to play tic-tac-toe against the optimal player. I am using a Huber Loss. It seems to work since the average reward increases with training, but the loss increases as well, which I was not expecting. Is this a clue that something is wrong or is it normal? submitted by /u/alesaso2000 [link] [comments]  ( 1 min )
    If we have a reward function that works perfectly for this, would adding another action such as steering have a significant effect on the possible optimal policy?
    If we have a working reward function, providing the desired behavior and optimal policy in a continuous action/state-space problem, would adding another action significantly affect the possible optimal policy? for example, assume you have an RL problem with an action space of 1 (de/acceleration), state-space of 2 (distance from position and velocity), and the agent is tasked to accelerate in a straight line from position a to b. do you think the agent would behave majorly differently? - I'm under the assumption that there would be minimal change aside from a longer training time assuming enough exploration, as the task is to still move in a straight line, but the agent would only have to account for steering action too now submitted by /u/philori [link] [comments]  ( 1 min )
    Does anyone know good python sources hardcoded of RL?
    As the title says, i would like to build my own RL agent but without using ready-to-build models such as baseline13 or tensorflow or pytorch but with hardcoded mathematical formula. submitted by /u/GarantBM [link] [comments]  ( 1 min )
    Learning policies based on mix of algorithms
    Dear RL community, I currently play around with using mixed RL algorithms on the same DRL policy network. Thus I can make use of expert MDP trajectories (e.g. BC), and offline simulation training (e.g. Off-Policy SAC) as well as online improvement (e.g. On-Policy PPO). While the algorithms greatly differ to each other, the resulting policy network tends to improve across stages, with a bump in performance during stage transitions. Have you tried something like this in before? How did it turn out for you in comparison to single algorithm approaches? submitted by /u/frugaleringenieur [link] [comments]  ( 1 min )
  • Open

    Artificial Intelligence in Social Media Enriches Marketing ROI
    Cos Leveraging Machine Learning Models in Social Media Platforms: What Motivates Social Media Marketers?   The intersections of marketing with artificial intelligence (AI) technology have opened up fascinating opportunities for marketers of all stripes. Stridently, AI has been making incessant roads into social media marketing, and rightly so. As the number of social media users rises… Read More »Artificial Intelligence in Social Media Enriches Marketing ROI The post Artificial Intelligence in Social Media Enriches Marketing ROI appeared first on Data Science Central.  ( 4 min )
    Why Gato from Deepmind is a game changer
    This week, there was (yet another) game changing announcement from the folks at Deepmind named Gato Gato is a cool cat 🙂 It leads us closer on the road to AGI because Gato is a transformer model that can do multiple things like caption images, chat with people, play games etc That means you train… Read More »Why Gato from Deepmind is a game changer The post Why Gato from Deepmind is a game changer appeared first on Data Science Central.  ( 3 min )
  • Open

    Managing Cloud Infrastructure Services With TPI Plugin For Terraform
    Terraform Provider Iterative, a new tool by Iterative for extending the capabilities of HashiCorp's Terraform, for optimizing machine learning engineers' cloud experience with automated cloud infrastructure tools: Iterative Introduces New ML Tool TPI Plugin For HashiCorp's Terraform Cloud Infrastructure Service TPI makes the Terraform for machine learning experience more consistent and less expensive with features including spot recovery improvements, automatic cleanup of unused resources, easy switching between cloud service vendors, and more. submitted by /u/thumbsdrivesmecrazy [link] [comments]  ( 1 min )
    Price recommendation for e-comm | methodology suggestions needed
    Hi everyone, Hope you are having a great weekend. I work for a small e-comm in my country as data scientist. My first project (tight timeline) is to recommend price increase for Products to give a small boost to revenue. I have fixed and curated all the transaction data that I might need but struggle with actually modelling the problem. Current approach I am working towards: 1. Build a log-log model for each product to identify Inelastic products 2. Log log model to be built for each product at daily level aggregate with cross elasticity of other products 3. Feature space for each product model; volume of the product sold in the day as dependent and other product price as price index relative to the product. Other seasonal features 3. Definition for inelasticity would be if the increase in price results in overall increase in revenue (this is because I want to take into the cross elasticity with other product and the impact of price change of the parent product on volume of unchanged product in the same category hierarchy) Challenges I am facing 1. R2 is relatively okay but finding it difficult to show some back testing to business to gain their confidence since R2 is a metric they do not understand/trust 2. When building the model at daily level not all products get sold on same day so there is a lot of missing cross product price index feature Any suggestions on improving the methodology will be very appreciated and for saving me from this pinch I would love to offer my time in future for any collaboration if needed or if it helps in any way submitted by /u/jimmyiceman [link] [comments]  ( 1 min )
    I created a DALL·E Flow website
    submitted by /u/tomd_96 [link] [comments]  ( 1 min )
    Microsoft Research Introduces i-Code: An Integrative and Composable Multimodal Machine Learning Framework
    Machine learning has long aimed to provide models with intelligence comparable to humans. Humans can automatically blend multiple sensory inputs like visual, linguistic, and acoustic signals to generate a complete knowledge of their surroundings by virtue of their intelligence. Even the most robust pre-trained AI models, in contrast to humans, are incapable of doing so, confining themselves to one or two input modalities. Researchers have always been interested in developing effective multimodal learning strategies to support this viewpoint. In their new paper, to further support this idea, the Microsoft Azure Cognitive Services Research team proposes a self-supervised pretraining framework names i-Code: An Integrative and Composable Multimodal Learning Framework. Continue Reading Paper: https://arxiv.org/pdf/2205.01818.pdf submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    AGI Survey for my Ethics Class
    I created a four question survey on the topics of medical AI ethics and artificial empathy. Answers are anonymous and responses are appreciated, thank you! Artificial Therapists submitted by /u/VoltGe [link] [comments]
    Need advice, looking for a chatbot with predefined knowledge
    Hi, I am a game developer and I am thinking about a making a game with a procedurally generated world. I'd like this world to have npcs that the player can talk to. Trouble is I can't write the dialog for all characters if the world is procedural, so the content of the player-npcs conversation is different with each play. There are two ways to solve this. a) The normal way. I can write some dialog and have it be full of variables and variants that change to match what is procedurally generated. There are big drawbacks to this though. There will still be a lot of manual work, which means this solution doesn't scale well. If I realize I need 2 times more content, i will have to do two times more work, probably more, since even the previous content will get more complex with more new variables. b) A general solution that doesn't need the manual wiring for each conversation. I've seen some great stuff with gpt-3 and gpt-neoX, but that seems like a different league entirely. My problem looks like this: The player engages in conversation with an npc. The npc, say a random trader, has some static behaviour, like prompting trade, or quests. But appart from that, the player has a bunch of predefined questions that he can ask the npc. Such as: Who are you? What's new in town? Where is the local blacksmith? .. What I need to do is collect all the data that this npc should know about, and have him interpret and understand the players question. The npc/ai recognizes that the player is asking for directions, asks the game to calculate response data (like: it's north from here, next to the tavern), and the npc/ai then synthetizes human speech like text containing the data as a response to the player. Is there a machine learning based solution that can help me implement the general solution to this problem? Thanks submitted by /u/Roggi44 [link] [comments]  ( 2 min )
    AI Researchers From Universidad Rey Juan Carlos, Spain Propose A New Method Of Contact Deformations Machine Learning For Real-Time Dynamic Simulation
    The modeling of touch and deformations has piqued computer graphics’ interest since it allows computer-generated models of persons and their surroundings to come to life. Despite significant advances in the domain, scientists still struggle to replicate high-resolution contact at interactive speeds. Many researchers are looking into ways to incorporate machine-learning approaches to model contact-driven deformations, inspired by their success in modeling self-driven deformations or deformations that occur due to an object’s motion. The methods developed so far learn rich nonlinear deformations as a function of the subspace state by using a subspace representation of the deformable object. However, the ML algorithms modeling contact deformation either simulate only smooth global contact responses or exhibit extremely restricted 3D interactions. According to the researchers, the previously developed models have some limitations. Deformations are modeled in an object-centric way, which is a good choice for self-driven deformations since it is smooth with regard to the object’s subspace state, and machine learning achieves high generalization even from sparse data. Contact-driven deformations, on the other hand, are not smooth about the object’s state. Therefore machine learning of these deformations would necessitate intensive sampling of the object’s subspace state. This is problematic because the configuration space is huge and difficult to cover. Continue Reading Paper: http://mslab.es/projects/ContactCentricLearning/contents/Romero\_SIG2022\_final.pdf Github: http://mslab.es/projects/ContactCentricLearning/ submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Ai background remover tool
    submitted by /u/Alive_Ad_2882 [link] [comments]
  • Open

    NEUROMORPHIC COMPUTING WILL NEED PARTNERS TO BREAK INTO THE DATACENTER
    submitted by /u/nnnaikl [link] [comments]

  • Open

    [P] How to do multivariate time series classification using C# and either the Accord.NET or Encog libraries?
    I have a time series based on financial security prices with additional features. I wish to feed this series into some ML construct in order to perform multi-class classification. Most of the solutions that I found in my search offer predictions. I am not interested in predicting future prices. I merely wish to train the ML construct to offer the most likely class for the given time-series input frame. I am looking for C# solutions or links to tutorials that use either the Accord.NET, Encog or ML.NET libraries. I would be most appreciative for answers that lead my eyes to view C# code that demonstrates a solution to my question. In lieu of the above, I would also appreciate a description of the types of ML constructs that would satisfy my requirements. I have no interest in Python solutions. Please, do not chastise me or praise Python. I need the code to be in C# so that it easily integrates into existing code. Thank you. Edit: I forgot to include ML.NET. submitted by /u/LeftShoeHighway [link] [comments]  ( 1 min )
    [P] I need help finding an AI that tells you what sports career is best for you.
    You ask a set of questions and it will use your answers to tell you what sports career is best for you. I have been having a hard time finding it online. Can anyone lend a hand? It would be greatly appreciated. submitted by /u/Texidork [link] [comments]  ( 1 min )
    [D] AI stocks
    The advances in AI the last two years have been mindblowing, I have taken som parttime MLclasses just to try to get a grasp. And wow, im impressed of what someone like me can do with low coding skills but high willingnes to learn. I have already built a recomendation model to improve my policy paragraphs based on input text from relevant research articles. I have to admitt that gramerly and quillbot beats my hobby project, but it was a fun run and I have gotten a ton of experience. One of them is that AI is clearly the future, and I want to place some of my investments in AI as a sector. Buy and hold for the future. The stockmarked is plumeting and will probably continue to do so for a while, but I want to start to research my options. Can you guys share your knowledge of tradable businesses, either pure AI conpanies or parentcompanies with controll? Im all for responsible trading, but feel free to share uncertain yolo companies as well. submitted by /u/sikkerhetellersafety [link] [comments]  ( 1 min )
    [P] I made an open-source demo of OpenAI's CLIP model running completely in the browser - no server involved. Compute embeddings for (and search within) a local directory of images, or search 200k popular images from Reddit (as shown in this video). Link to demo and Github repo in comments.
    submitted by /u/joerocca [link] [comments]  ( 2 min )
    Which Alg to use? [R]
    Hey, So I am taking a few images and wanting to use them to train a model to predict how much a test image matches up to the training images. Would using a CNN be my best bet or using haarcascade classifiers? Any other thoughts? I am doing this in Python on google Collab. Thanks! submitted by /u/Cloverdover1 [link] [comments]  ( 1 min )
    [Project] Volunteers Needed for Ukraine Project
    We are recruiting volunteers for a project that will help Ukraine. This is a data-oriented project, and we can use all the help we can get. We want to work very intensely on this project so we can release it quickly. To join us and help Ukraine, please reach out to [breaker25789@gmail.com](mailto:breaker25789@gmail.com) with your name, email, and the team you are interested in. Data Team · No prior skills necessary. New volunteers will receive training in identifying soldiers and military equipment upon joining our team. · This role takes a minimum of five (5) hours a week. · Minimum Age: 18+ · CONTENT WARNING: The primary role of a member of the Data Team is to directly interact with photos and videos from the war in Ukraine, which often contain graphic images of violence and death. Machine Learning Team · Each volunteer needs to be able to dedicate a minimum of ten (10) hours a week. · Preferred prior experience includes familiarity with Docker, AWS SageMaker and S3, machine learning attacks, machine learning security, dedicated red team work, and/or data science. · Minimum Age: 18+ · CONTENT WARNING: Individuals directly involved in training certain algorithms will be exposed to photos and videos from the war in Ukraine, which often contain graphic violence and death. Please notify us if you would prefer to not see that content. submitted by /u/OttersAreDevilSpawn [link] [comments]  ( 2 min )
    [D] Research Director at Deepmind says all we need now is scaling
    submitted by /u/SnoozeDoggyDog [link] [comments]  ( 5 min )
    [P] Image Fusion Techniques for Image classification Task
    Can anyone recommend sources on image fusion techniques (preferably for RGB and near-infrared images) for image classification tasks. submitted by /u/Antman-007 [link] [comments]
    [D] Taking derivative of Expectation with respect to Phi (Variational Inference)
    The snippet below from page 20 of the paper here mentions that derivative cannot be taken inside the expectation as expectation is a function of phi. ​ https://preview.redd.it/jo4e2nt0lgz81.png?width=1148&format=png&auto=webp&s=1f3acb968a07aa3858657770c12689031105a03d However, the paper here (Page 3) from the same author shows score function estimator being used to estimate gradient that takes the derivative inside expectation even when expectation is a function of phi. Highlighted in the snippet below: ​ https://preview.redd.it/fsv8s272lgz81.png?width=940&format=png&auto=webp&s=c5ad403563d1f9c8048f95a413eb188c35e785c3 I am not being able to understand the discrepancy. Could anyone please help me get insight on this? I feel that I am missing something. submitted by /u/That-Mud3051 [link] [comments]  ( 1 min )
    [D] Best resources to keep up with latest machine learning research
    I am an 'applied' machine learning researcher, i.e. about 80% of my time is on machine learning, 20% is applying it to physics problems. The increasing breadth and depth of new machine learning research is awe-inspiring. I would like to be able to keep up with the newest developments in the field, somehow, without obviously having the time to read all the latest developments. Is there a website, a resource (like weekly or monthly magazines), or community aimed at collating the newest insights and directions and publishing summaries/overviews in digestible formats? submitted by /u/intheprocesswerust [link] [comments]  ( 2 min )
    [R] Predicting the Elections with Deep Learning - Part 1 - Results
    Hi everyone 👋, During the last 18 months, I played — maybe too much — with census & elections data and tried to predict the french elections with deep learning. I'm not officially a ML researcher, that's only a side project, so be nice please 🙏 ! I'm sharing this only to have fun discussions with people who have the same kink as me. 😊 Here it is 👉 Predicting the Elections with Deep Learning - Part 1 - Results You don't have to be a data scientist to read this 1st post which talks only about the results of the experiment. TLDR of this 1st post: No I didn't find a way to predict the elections results But It's surprising how much you can learn about voters behavior (who votes for which party), even using only aggregated public data. It's only qualitative results so it's not real hard science. Still, it's interesting (worrying?) to think about what would be possible with more data (e.g: FAANG) After that, I'll write 2 more posts: 1 for the model implementation and 1 for the MLOps tooling I've used. submitted by /u/qchenevier [link] [comments]  ( 3 min )
    [P] Java implementation of GPT2 tokenizer
    GPT2 Tokenizer Java When developing a service using the GPT3 API, we often need to count the number of tokens. However, if you develop a service in Java, it is not easy to count this. GPT3 is known to use the same tokenizer as GPT2, so this should be a huge help for someone. For more detail and code, refer to https://github.com/hyunwoongko/gpt2-tokenizer-java. Requirements Please install the following dependencies to use the library. ``` implementation 'com.google.api-client:google-api-client:1.32.2' implementation 'org.apache.commons:commons-lang3:3.12.0' implementation 'org.springframework.boot:spring-boot-starter-web' testImplementation 'org.junit.jupiter:junit-jupiter-api:5.3.1' testRuntimeOnly 'org.junit.jupiter:junit-jupiter-engine:5.3.1' ``` Add tokenizer files to resources directory Please add encoder.json and vocab.bpe files to your project resources directory. these files can be found here. Usage The following are simple examples of this library. To check test code for this, refer to here. Encoding text to tokens ```java import ai.tunib.tokenizer.GPT2Tokenizer; import java.util.List; GPT2Tokenizer tokenizer = GPT2Tokenizer.fromPretrained("PATH/IN/RESOURCES"); List result = tokenizer.encode("Hello my name is Kevin."); [15496, 616, 1438, 318, 7939, 13] ``` Decoding tokens to text ```java import ai.tunib.tokenizer.GPT2Tokenizer; GPT2Tokenizer tokenizer = GPT2Tokenizer.fromPretrained("PATH/IN/RESOURCES"); String result = tokenizer.decode(List.of(15496, 616, 1438, 318, 7939, 13)); "Hello my name is Kevin." ``` License This project is licensed under the terms of the Apache License 2.0. Copyright 2022 Hyunwoong Ko. All Rights Reserved. submitted by /u/hyunwoongko [link] [comments]  ( 1 min )
    [D] [R] AdaVAE: Exploring Adaptive GPT-2s in Variational Auto-Encoders for Language Modeling
    https://arxiv.org/abs/2205.05862 It looks promising, what do you think? submitted by /u/carl__11 [link] [comments]
    [D] LayoutLM: Pre-training of Text and Layout for Document Image Understanding (Paper Summary)
    Pre-training of models for NLP applications, exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. This paper proposes LayoutLM that jointly model interactions between text and layout information across scanned document images. Fits very well in use-cases like Resume parsing, Bills parsing, Table parsing, etc. Per Summary: https://youtu.be/ewyDVIdKXm0 Paper Link: https://arxiv.org/abs/1912.13318 submitted by /u/prakhar21 [link] [comments]  ( 1 min )
    [D]Geometric perspective of meta-learning
    I am curious if there is some work explaining the efficiency of meta-learning from a geometric perspective. For example, the famous meta-learning algorithm MAML aims to find parameters of a neural network model that can quickly adapt to new tasks, intuitively it is very similar to finding a point in the parameter space that has short distances between all the optimal parameters of training tasks, i.e a barycenter under some distance metric, and then the adaptation step may be considered as a transport path. This is just a simple (maybe naive) example, would be great if you can share some related work. Thank you! submitted by /u/mike_shen [link] [comments]  ( 1 min )
    [D] ICML author notifications
    The author notifications for ICML submissions will be released 14 May (Anywhere on Earth time). I thought to start a discussion thread here so we can discuss any issues and share commiserations/celebrations. Note the release date means it could be as late as end of day on 14 May Anywhere on Earth time, effectively meaning 15 or even 16 May for some people, depending on where you live and if there are delays. For reference: the notification of phase 1 rejections were delayed by some hours (around 6 or 12, but not more than 24 iirc); and the author feedback period seemed to open on time, maybe even early. I think the main thing to remember is that no matter the result, your work still has value and you still have value. You are not your work, and any evaluation on the value of your work is just a few people's opinions in a noisy, messy system. You can check Anywhere on Earth time here: https://time.is/Anywhere_on_Earth submitted by /u/tfburns [link] [comments]  ( 3 min )
    [P] Identifying if there are any clusters in a data warehouse
    Hi, I have a lot of structured data that have points that should and are random (i.e we don’t want a cluster of points in space but are and should be sparse). The data is basically a dataset that had leak issues. The problem statement is find if there’s any subset of data with the same root cause (i.e same brand or same country caused the issue) There’s a mixture of numerical and categorical data. (I.e size, lat, long), (country, brand, priority, city). Assume there’s around 20+ dimensions (columns) and my first approach was to find if those points have any cluster. To do so, I was going to use a density based cluster algorithm since you don’t have to specify the number of clusters and it will just ignore noise (whichbshould be most of the points). But hard to preserve the meaning of the columns with dimensionality reduction and obviously we can’t cluster in 20 dimensions, it would take forever. We don’t know what could cause the leaks, and we don’t know which column is important or not. What’s an alternative approach or a better solution? Thanks ​ (Note my post got taken down so reposting) submitted by /u/micdean19 [link] [comments]  ( 2 min )
    How Useful Is Small Data Experimentation? [Discussion]
    Hey all, I'm a front end engineer with an interest in building user interfaces for machine learning tools. I'm wanting to build a library for creating web user interfaces that leverage numpy, scipy, sklearn, and any other data science library supported by Pyodide (a tool for running Python in the web browser). The advantage is that there's no environment setup, can run anywhere, and that complex user interfaces can be built easily using existing web frameworks (like React, Vue, etc). The limitations of this is that since it's running in the browser, runtimes will be longer (1-12x longer according to documentations), and working with bigger data sets may not be feasible. Also, we'd have access to only a few of the data science libraries available in Python (mentioned above). So, really,…  ( 2 min )
  • Open

    Gato & AGI doubts
    Just read https://thenextweb.com/news/deepminds-astounding-new-gato-ai-makes-fear-humans-will-never-achieve-agi? I didn't read the whole article but based off the title I'm guessing the author of the article doubts AGI will happen because of what he sees with Gato? Why would he think that? Maybe I'm missing something submitted by /u/Ashamed-Asparagus-93 [link] [comments]  ( 1 min )
    Artificial Intelligence Books: These 10 Sci-Fi Novels You Must Read
    submitted by /u/much_successes [link] [comments]  ( 1 min )
    New ML tool to help data scientists manage cloud workloads with Terraform
    submitted by /u/thumbsdrivesmecrazy [link] [comments]
    GALLERY
    submitted by /u/cookingandcraft [link] [comments]
    Artificial Intelligence Implications: The Future of Formula One | This is my second post in a university blog series surrounding AI, The Future, and a personal area of interest! Would love some feedback and advice for future episodes!
    submitted by /u/RvZz11 [link] [comments]  ( 1 min )
    Seeking the perfect song recommendation method
    I hope you are doing as fine as you can. I am investigating and researching solutions to a problem for a few months now and I need your valuable help today. Problematic: Generate a playlist based on multiple users' music tastes. The goal is that the playlist suits everyone, at the higher level of satisfaction for every user. Research path Spotify and recommendation engines I arrived at the following conclusion: the best way to achieve it is to build a music recommendation engine, but of course, you don't want to create it by yourself because it will need a huge amount of data that I can't provide nor find, so I need to work with a third-party service that already has this data and recommendation engine, and it seems like there are not many companies that offer that service and that'…  ( 3 min )
    Hello, if i can have a minute of your time to answer some question about A.I and its uses in cyber crime hope you will help me with this one.
    submitted by /u/SpoiledCheweez [link] [comments]  ( 1 min )
    ai is making ai jokes about ai with ai voice...
    submitted by /u/Alive_Ad_2882 [link] [comments]
    DeepMind’s new AI "Gato" can do 604 tasks e.g play Atari, control Robots, chat, caption Images, etc
    submitted by /u/qptbook [link] [comments]
    AI vs AI - Two AI Bots Chatting
    submitted by /u/Alive_Ad_2882 [link] [comments]  ( 1 min )
    Microsoft AI Team Proposes Floquet Codes: A New Class Of Quantum Error Correction Codes That Are Well Suited To Topological Qubits
    The Microsoft Azure Quantum program is built on technological advancements that enable quantum computing to scale. Microsoft announced the notion of topological qubits in March 2022, which are qubits that are theoretically more stable than existing ones without sacrificing size or speed. However, developing a general-purpose quantum computer capable of solving industrial-scale issues will necessitate innovation at all levels of the quantum stack, from nanoscale materials to algorithms and applications. Scientists state that quantum states are inherently fragile and are quickly destroyed when a qubit is coupled to its surroundings. This results in noise, developing the quantum computer a challenging task. Error correction, which is also utilized in traditional digital computing, is a critical technology for overcoming this fragility. Quantum error correction (QEC) can detect and fix most faults that occur on physical qubits by encoding the state of a single logical qubit into multiple physical qubits. However, depending on the quality of the physical qubits, error correction might raise a computation’s space requirements by a factor of thousands. In addition, its time requirements by more than tenfold. As a result, any improvements in mistake correction have a huge positive impact throughout the stack. Continue Reading Paper: https://arxiv.org/pdf/2202.11829.pdf submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
  • Open

    Sampling with replacement until you’ve seen everything
    Suppose you have a standard deck of 52 cards. You pull out a card, put it back in the deck, shuffle, and pull out another card. How long would you expect to do this until you’ve seen every card? Here’s a variation on the same problem. Suppose you’re a park ranger keeping data on tagged […] Sampling with replacement until you’ve seen everything first appeared on John D. Cook.  ( 3 min )
  • Open

    Profiling Python Code
    Profiling is a technique to figure out how time is spent in a program. With these statistics, we can find the “hot spot” of a program and think about ways of improvement. Sometimes, a hot spot in an unexpected location may hint at a bug in the program as well. In this tutorial, we will […] The post Profiling Python Code appeared first on Machine Learning Mastery.  ( 18 min )

  • Open

    Last Week in AI: FDA Clearances, Firing at Google AI, AI for Apple Watch, AI Reviews Beer and Wine
    submitted by /u/regalalgorithm [link] [comments]
    The most effective method to befuddle AI models
    submitted by /u/p0goniphaft111 [link] [comments]
    amazing
    submitted by /u/Murat1_31 [link] [comments]
    AI News | New Intel Gaudi2 & Greco AI Deep Learning Processors | AI Predicts Crohn Disease | AI & Ocean Waves
    submitted by /u/SlightSituation [link] [comments]
    Rendering 3D objects using differentiable SDFs
    submitted by /u/imapurplemango [link] [comments]  ( 1 min )
    Gato: A single Transformer to RuLe them all! (Deepmind's new model)
    submitted by /u/OnlyProggingForFun [link] [comments]
    China Says It’s 3D Printing a 590-Foot Hydroelectric Dam With Zero Human Labor
    submitted by /u/estasfuera [link] [comments]
    Intel’s Habana Labs Introduces Second-Generation AI Deep Learning Processors
    https://preview.redd.it/8ftfke16f7z81.jpg?width=8688&format=pjpg&auto=webp&s=4429ed198c9d0adc06cd0e70a5dda54a1bc070d8 Today, Intel’s Habana Labs team announced two critical new products: Gaudi2, the 2nd iteration of the Gaudi DL training processor, and Greco, the successor to the Goya DL inference processor. Intel’s CPUs are significantly faster than their predecessors and competitors. While Gaudi and Goya are the first new chips launched by Habana Labs following its acquisition by Intel. Gaudi2 and Greco switched from a 16nm to a 7nm technology (via TSMC, the manufacturer). The 10 Tensor processing cores present in the Gaudi training processor have been increased to 24. Also, the in-package memory capacity has tripled from 32GB (HBM2) to 96GB (HBM2E). The onboard SRAM has been increased from 24MB to 48MB. Gaudi2 is the first and only accelerator with such an amount of memory. The CPU has a TDP of 600W (vs. Gaudi – 350W), although it is claimed that it still employs passive cooling and does not require liquid cooling. Continue Reading submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Meta AI Introduces ‘Make-A-Scene’: A Deep Generative Technique Based On An Autoregressive Transformer For Text-To-Image Synthesis With Human Priors
    In recent years, the research related to text-to-image generation has been growing exponentially. Nevertheless, the current methods still lack at least three essential characteristics. First of all, most models accept as input solely the text information. This is a massive limitation, as the controllability of the model is limited to style or color, but it can not be extended to structure or form, for example. The second limitation is related to human perception: indeed, the final aim of these models is to match human perception and attention, but, in reality, the generation process does not include any relevant prior knowledge on this. For example, the losses which control the generation are usually applied to the whole image without adding a specific focus on parts fundamental to human perception (such as human faces, animals, or salient objects). The last missing characteristic is the always-present problem of quality and resolution, as most of the works are limited to an output resolution of 256×256. For these reasons, the team of Facebook AI has introduced Make-A-Scene. This novel method successfully tackled these three gaps while attaining SOTA results for text-to-image generation. The proposed model is essentially three encoders with discrete tokens, an auto-regressive transformer that learns to generate sequences of tokens conditioned on the scene segmentation and a decoder that generates images from this transformer-generated sequence. It is important to note that the network does not use the segmented scene for computing the loss, thus, the segmentation is not necessary at inference time. The model is resumed in the figure below. Continue Reading This Summary Paper: https://arxiv.org/pdf/2203.13131v1.pdf submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    BlobGAN: A GAN model that uses simple blobs to manipulate objects in images
    submitted by /u/OnlyProggingForFun [link] [comments]
  • Open

    Enhance the caller experience with hints in Amazon Lex
    We understand speech input better if we have some background on the topic of conversation. Consider a customer service agent at an auto parts wholesaler helping with orders. If the agent knows that the customer is looking for tires, they’re more likely to recognize responses (for example, “Michelin”) on the phone. Agents often pick up […]  ( 5 min )
    Run automatic model tuning with Amazon SageMaker JumpStart
    In December 2020, AWS announced the general availability of Amazon SageMaker JumpStart, a capability of Amazon SageMaker that helps you quickly and easily get started with machine learning (ML). In March 2022, we also announced the support for APIs in JumpStart. JumpStart provides one-click fine-tuning and deployment of a wide variety of pre-trained models across […]  ( 7 min )
  • Open

    [R] Scaling tasks during pre-training may be > Scaling Model Parameters
    https://arxiv.org/pdf/2201.06910.pdf ...This leads to a crucial discovery that task scaling can be an efficient alternative to model scaling; i.e., the model size has little impact on performance with an extremely large number of tasks. Our results show that task scaling can substantially improve training efficiency by 30 times in FLOPs.. tl;dr scale the amount of tasks as well as data, compute, hyperparameters, FLOPs and prayers for effectively training LLMs. submitted by /u/Competitive-Rub-1958 [link] [comments]  ( 1 min )
    [Discussion] We tried learning AI from games. How about learning from players?
    I wrote a post arguing that video games are more relevant than ever for AI research. Essentially, RL is at an impasse, and all the really impressive progress comes from self-supervised learning. Could we learn behavior foundation models from millions of traces of real humans playing real games, and would this get us more general and more real intelligence? https://modl.ai/learning-ai-from-players/ submitted by /u/togelius [link] [comments]  ( 2 min )
    [P] I was tired of screenshotting plots in Jupyter to share my results. Wanted something better, information rich. So I built a new %%share magic that freezes a cell, captures its code, output & data and returns a URL for sharing.
    ​ https://reddit.com/link/uosqgm/video/pxk7h4jb49z81/player You can try it out in Colab here: https://colab.research.google.com/drive/1E5oU6TjH6OocmvEfU-foJfvCTbTfQrqd?usp=sharing#scrollTo=cVxS_6rBmLKW To install: pip install thousandwords Then in Jupyter Notebook: from thousandwords import share Then: %%share # Your Python code goes here.. More details: https://docs.1000words-hq.com/docs/python-sdk/share Source: https://github.com/edouard-g/thousandwords Homepage: https://1000words-hq.com ------------------------------- EDIT: Thanks for upvotes and the feedback. People have voiced their concerns of inadvertent data leaks, and that the Python package wasn't doing enough to warn the user ahead of time. As a short-term mitigation, I've pushed an update. The %%share magic now warns the user about exactly what gets shared and requires manual confirmation (details below). We'll be looking into building an option to share privately. Feel free to ping me for questions/concerns. More details on the mitigation: from thousandwords import share x = 1 Then: In [3]: %%share ...: print(x) This will upload 'x' server-side. Anyone with the link will have read access. Do you wish to proceed ? [y/N] ​ submitted by /u/Left_Ad8361 [link] [comments]  ( 5 min )
    [R] Clinical Prompt Learning with Frozen Language Models
    Hi everyone! We are a team of computational and clinical scientists for Chronosig Project, Oxford. We are excited to share our work on prompt learning for clinical support tasks. A major concern in the field of NLP is that even the largest pre-trained language models, such as GPT-3, do not perform well in specialized domains such as clinical texts. Typically adapting large PLMs to new domains requires fine-tuning entire models on new texts and downstream tasks, which can require entire fleets of high-end GPUs. Prompt learning is a new paradigm of NLP to fully utilize the pretrained language model based on its pretraining task and can require substantially fewer training parameters given a new task. In this work, we have investigated the feasibility of using prompt learning with small or medium domain trained frozen language models versus "a traditional" fine-tuning on clinical tasks using MIMIC-III data with full-data and few-shot settings. We also develop a new clinical triage task based on the ICD-9 codes, available for future use. We demonstrated that prompt learning requires fewer tuned parameters to match and even outperform traditional fine-tuning in few-shot settings. We actually observed prompt learning could match the performance of traditional fine-tuning with 1000 times fewer trained parameters. This is particularly encouraging for high-stake domains, such as medicine, where high-quality annotated data is scarce. We hope prompt tuning can offer a competitive alternative to a fine-tuning of large language models in the medical domain and act as a framework moving forward. Arxiv: https://arxiv.org/abs/2205.05535 GitHub: https://github.com/NtaylorOX/Public_Clinical_Prompt EDIT: For someone who may need TL;DR: we are creating a blog post, and we hope it will be released by next week! ​ ​ An illustration of the workflow of CPL submitted by /u/YiTsubasa [link] [comments]  ( 2 min )
    [D] ZIP models as a means to handle regression on data with excess of zeros
    Hi Everyone, Sharing an article on how to handle regression for data which has lots and lots of zeros. VevestaX/ZIP_tutorial.md at main · Vevesta/VevestaX · GitHub This article has been written based on my personal experience. While using regression on this peculiar problem on kaggle I was confounded by excessive 0 in the dependent variable. The reason why models like regression and Poisson failed wasn't immediately clear to me. On digging I found less known models - Hurdle and ZIP models are specially fine-tuned for such a problem. Happy to hear feedback on the article. Please let me know if you know other techniques to handle this problem of excessive 0 in dependent variable. submitted by /u/vevesta [link] [comments]  ( 1 min )
    [D] System Requirement to host a finetuned GPT Neo X 20B model
    GPT 3 Davinci with 175B parameters costs $0.0600 / 1K tokens Their cheaper model Curie with 13B parameters costs $0.0060 / 1K tokens ​ GPT Neo X has 20B parameters... ​ But on so many sites, GPT Neo X with 20B parameters cost more than GPT 3 Davinci with 175B parameters... ​ For example on NLPCloud, GPT Neo X costs $0.095/1K tokens. ​ and on Goose AI, it costs 0.063 / 1K tokens ​ Quality difference between Davinci and Neo X is huge, but why price is higher than Davinci? ​ So I was thinking about hosting it on my own custom server. ​ I know finetuning it gonna require a lot of resources but it's only one time thing. So I don't really mind it that much... But can you please give me any estimated information about how much is it gonna cost or what's system requirements to host a already finetuned Neo X model on a server? Btw my usage is to process around 200,000 tokens per day ( ~1000-2000 requests ) submitted by /u/Homeless_Programmer [link] [comments]  ( 1 min )
    [D] word misleading classification model
    In my text classification a particular word misleading the model. But these words are very high in the training data for a particular lable. Eg: i have a training data contains " lost my phone", "changed my phone", .... All these labels are belongs to " problem with telephone" class. Now, I am using Universal sentence encoder to build the model. During inference if i have given some random sentences and put the word " phone" in the middle, But still my model is predicting "problem with telephone" class with higher confidence. How should we handle these situation. submitted by /u/NSVR57 [link] [comments]  ( 1 min )
    [D] Anomaly detection - deviation from normal
    Hi everyone, I am collecting temperature data from a sensor measuring the temperature of a process. The temperature raises to go in steady-state during operation and it cools down when the process is stopped. I would like to set a simple threshold notifying me when the temperature during operation is higher or lower than expected, how should I tackle the following: Filter temperatures when the process is warming up (non-steady state) How much data should I use to define the threshold? More like a moving window (median, MAD, ...) or a mean reversion on all historical data? I am measuring the room temeprature with another sensor, how do I remove seasonality in your opinion with the second sensor's temperature data? (substraction, division, ...) How to update the thresholds with new data collected? Thank you all. submitted by /u/No-Way3852 [link] [comments]  ( 1 min )
  • Open

    Startups using Reinforcement Learning
    What are the upcoming startups that are based on RL? I know covariant.ai and kindred.ai use RL and are there any other RL-based startups submitted by /u/blitzkreig3 [link] [comments]
    Flattening the layers that preprocess the observation space
    In this paper, the observation space is a dictionary that has four keys: - Image - Direction of the agent - Position of the agent - Recurrent state of the policy From line 237 to line 248 (https://github.com/google-research/google-research/blob/98b89bd2d67c7944a8ac381e0417d4c20a6c87ee/social_rl/multiagent_tfagents/joint_attention/attention_networks.py#L237), a few preprocessing layers to process those four tensors of the obs space are built. What I don't understand is what happens after that: preprocessing_nest = tf.nest.map_structure(lambda l: None, preprocessing_layers) flat_observation_spec = nest_utils.flatten_up_to( preprocessing_nest, observation_spec, ) If I understand correctly, a lambda None function is applied to each of those preprocessing layers. This is then flattened up to the observation space, to retain some structure. I don't understand what this is exactly doing. Can someone explain? submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Which learning algorithms count as "Deep RL"?
    Deep Q-Networks are of course Deep RL. Is PPO another example? How about A2C/A3C/SAC? I don't know enough about these algorithms yet to decide for myself and I'm struggling to find adequate information online. Thanks in advance! submitted by /u/C_BearHill [link] [comments]  ( 1 min )
    In how many steps should we update our networks in DDPG?
    Correct me if I am wrong but in the original DDPG paper, I saw that update for actor and critic networks was inside the step loop (of an episode). But this will very expensive computationally. So, in how many steps or global_steps should I update my models? submitted by /u/Better-Ad8608 [link] [comments]  ( 1 min )
    Q-Learning Example Tutorial (w/ Q-table & Bellman equation)
    submitted by /u/lukenewmann1 [link] [comments]
    Cannot find where’s the in-place operation error.
    I'm trying to train the MADDPG model; however, it has occurred an in-place operation error. Here's the traceback, and I've taken some excerpts that I think are critical: ​ RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 2]], which is output 0 of AsStridedBackward0, is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck! ​ [W ..\torch\csrc\autograd\python_anomaly_mode.cpp:104] Warning: Error detected in AddmmBackward0. Traceback of forward call that caused the error: ... c_loss, a_loss = all_agents.learn(memory) File "C:\Users\chhu…  ( 2 min )
    Gato: A single Transformer to RuLe them all! (Deepmind's new model)
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    Need some help for recommendation system
    Hello. As a part of a short internship (around a month long), I am required to create a recommendation system for a eCommerce. I was planning to build some intuition how state transfer will take place, how state will look. Can you help me in deciding environment, state, reward, policy for eCommerce setting. Dataset - https://tianchi.aliyun.com/competition/entrance/231721/information?lang=en-us submitted by /u/kachua26 [link] [comments]  ( 2 min )
    "Searching for Efficient Neural Architectures for On-Device ML on Edge TPUs", Akin et al 2022 {G}
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    New AI Deep Learning Processors | Intel's Gaudi2 & Greco
    submitted by /u/getrich_or_diemining [link] [comments]
  • Open

    Broom, Broom: WeRide Revs Up Self-Driving Street Sweepers Powered by NVIDIA
    When it comes to safety, efficiency and sustainability, autonomous vehicles are delivering a clean sweep. Autonomous vehicle company and NVIDIA Inception member WeRide this month began a public road pilot of its Robo Street Sweepers. The vehicles, designed to perform round-the-clock cleaning services, are built on the high-performance, energy-efficient compute of NVIDIA. The fleet of Read article > The post Broom, Broom: WeRide Revs Up Self-Driving Street Sweepers Powered by NVIDIA appeared first on NVIDIA Blog.  ( 2 min )
  • Open

    A Beginners Guide to RPA ‘ Bots’
    Enterprises burdened with the processes that are repetitive, time-consuming & mundane are constantly on the lookout for new and innovative…  ( 4 min )
  • Open

    How Does Knowledge Graph Embedding Extrapolate to Unseen Data: A Semantic Evidence View. (arXiv:2109.11800v3 [cs.CL] UPDATED)
    Knowledge Graph Embedding (KGE) aims to learn representations for entities and relations. Most KGE models have gained great success, especially on extrapolation scenarios. Specifically, given an unseen triple (h, r, t), a trained model can still correctly predict t from (h, r, ?), or h from (?, r, t), such extrapolation ability is impressive. However, most existing KGE works focus on the design of delicate triple modeling function, which mainly tells us how to measure the plausibility of observed triples, but offers limited explanation of why the methods can extrapolate to unseen data, and what are the important factors to help KGE extrapolate. Therefore in this work, we attempt to study the KGE extrapolation of two problems: 1. How does KGE extrapolate to unseen data? 2. How to design the KGE model with better extrapolation ability? For the problem 1, we first discuss the impact factors for extrapolation and from relation, entity and triple level respectively, propose three Semantic Evidences (SEs), which can be observed from train set and provide important semantic information for extrapolation. Then we verify the effectiveness of SEs through extensive experiments on several typical KGE methods. For the problem 2, to make better use of the three levels of SE, we propose a novel GNN-based KGE model, called Semantic Evidence aware Graph Neural Network (SE-GNN). In SE-GNN, each level of SE is modeled explicitly by the corresponding neighbor pattern, and merged sufficiently by the multi-layer aggregation, which contributes to obtaining more extrapolative knowledge representation. Finally, through extensive experiments on FB15k-237 and WN18RR datasets, we show that SE-GNN achieves state-of-the-art performance on Knowledge Graph Completion task and performs a better extrapolation ability. Our code is available at https://github.com/renli1024/SE-GNN.  ( 3 min )
    Positive, Negative and Neutral: Modeling Implicit Feedback in Session-based News Recommendation. (arXiv:2205.06058v1 [cs.IR])
    News recommendation for anonymous readers is a useful but challenging task for many news portals, where interactions between readers and articles are limited within a temporary login session. Previous works tend to formulate session-based recommendation as a next item prediction task, while they neglect the implicit feedback from user behaviors, which indicates what users really like or dislike. Hence, we propose a comprehensive framework to model user behaviors through positive feedback (i.e., the articles they spend more time on) and negative feedback (i.e., the articles they choose to skip without clicking in). Moreover, the framework implicitly models the user using their session start time, and the article using its initial publishing time, in what we call "neutral feedback". Empirical evaluation on three real-world news datasets shows the framework's promising performance of more accurate, diverse and even unexpectedness recommendations than other state-of-the-art session-based recommendation approaches.  ( 2 min )
    RLOC: Terrain-Aware Legged Locomotion using Reinforcement Learning and Optimal Control. (arXiv:2012.03094v3 [cs.RO] UPDATED)
    We present a unified model-based and data-driven approach for quadrupedal planning and control to achieve dynamic locomotion over uneven terrain. We utilize on-board proprioceptive and exteroceptive feedback to map sensory information and desired base velocity commands into footstep plans using a reinforcement learning (RL) policy. This RL policy is trained in simulation over a wide range of procedurally generated terrains. When ran online, the system tracks the generated footstep plans using a model-based motion controller. We evaluate the robustness of our method over a wide variety of complex terrains. It exhibits behaviors which prioritize stability over aggressive locomotion. Additionally, we introduce two ancillary RL policies for corrective whole-body motion tracking and recovery control. These policies account for changes in physical parameters and external perturbations. We train and evaluate our framework on a complex quadrupedal system, ANYmal version B, and demonstrate transferability to a larger and heavier robot, ANYmal C, without requiring retraining.  ( 2 min )
    A BIC-based Mixture Model Defense against Data Poisoning Attacks on Classifiers. (arXiv:2105.13530v2 [cs.LG] UPDATED)
    Data Poisoning (DP) is an effective attack that causes trained classifiers to misclassify their inputs. DP attacks significantly degrade a classifier's accuracy by covertly injecting attack samples into the training set. Broadly applicable to different classifier structures, without strong assumptions about the attacker, an {\it unsupervised} Bayesian Information Criterion (BIC)-based mixture model defense against "error generic" DP attacks is herein proposed that: 1) addresses the most challenging {\it embedded} DP scenario wherein, if DP is present, the poisoned samples are an {\it a priori} unknown subset of the training set, and with no clean validation set available; 2) applies a mixture model both to well-fit potentially multi-modal class distributions and to capture poisoned samples within a small subset of the mixture components; 3) jointly identifies poisoned components and samples by minimizing the BIC cost defined over the whole training set, with the identified poisoned data removed prior to classifier training. Our experimental results, for various classifier structures and benchmark datasets, demonstrate the effectiveness and universality of our defense under strong DP attacks, as well as its superiority over other works.  ( 2 min )
    Ensemble Clustering via Co-association Matrix Self-enhancement. (arXiv:2205.05937v1 [cs.LG])
    Ensemble clustering integrates a set of base clustering results to generate a stronger one. Existing methods usually rely on a co-association (CA) matrix that measures how many times two samples are grouped into the same cluster according to the base clusterings to achieve ensemble clustering. However, when the constructed CA matrix is of low quality, the performance will degrade. In this paper, we propose a simple yet effective CA matrix self-enhancement framework that can improve the CA matrix to achieve better clustering performance. Specifically, we first extract the high-confidence (HC) information from the base clusterings to form a sparse HC matrix. By propagating the highly-reliable information of the HC matrix to the CA matrix and complementing the HC matrix according to the CA matrix simultaneously, the proposed method generates an enhanced CA matrix for better clustering. Technically, the proposed model is formulated as a symmetric constrained convex optimization problem, which is efficiently solved by an alternating iterative algorithm with convergence and global optimum theoretically guaranteed. Extensive experimental comparisons with twelve state-of-the-art methods on eight benchmark datasets substantiate the effectiveness, flexibility and efficiency of the proposed model in ensemble clustering. The codes and datasets can be downloaded at https://github.com/Siritao/EC-CMS.  ( 2 min )
    Contingency-constrained economic dispatch with safe reinforcement learning. (arXiv:2205.06212v1 [eess.SY])
    Future power systems will rely heavily on micro grids with a high share of decentralised renewable energy sources and energy storage systems. The high complexity and uncertainty in this context might make conventional power dispatch strategies infeasible. Reinforcement-learning based (RL) controllers can address this challenge, however, cannot themselves provide safety guarantees, preventing their deployment in practice. To overcome this limitation, we propose a formally validated RL controller for economic dispatch. We extend conventional constraints by a time-dependent constraint encoding the islanding contingency. The contingency constraint is computed using set-based backwards reachability analysis and actions of the RL agent are verified through a safety layer. Unsafe actions are projected into the safe action space while leveraging constrained zonotope set representations for computational efficiency. The developed approach is demonstrated on a residential use case using real-world measurements.  ( 2 min )
    Maximum sampled conditional likelihood for informative subsampling. (arXiv:2011.05988v3 [math.ST] UPDATED)
    Subsampling is a computationally effective approach to extract information from massive data sets when computing resources are limited. After a subsample is taken from the full data, most available methods use an inverse probability weighted (IPW) objective function to estimate the model parameters. The IPW estimator does not fully utilize the information in the selected subsample. In this paper, we propose to use the maximum sampled conditional likelihood estimator (MSCLE) based on the sampled data. We established the asymptotic normality of the MSCLE and prove that its asymptotic variance covariance matrix is the smallest among a class of asymptotically unbiased estimators, including the IPW estimator. We further discuss the asymptotic results with the L-optimal subsampling probabilities and illustrate the estimation procedure with generalized linear models. Numerical experiments are provided to evaluate the practical performance of the proposed method.  ( 2 min )
    Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results. (arXiv:2204.03475v2 [cs.CV] UPDATED)
    ImageNet serves as the primary dataset for evaluating the quality of computer-vision models. The common practice today is training each architecture with a tailor-made scheme, designed and tuned by an expert. In this paper, we present a unified scheme for training any backbone on ImageNet. The scheme, named USI (Unified Scheme for ImageNet), is based on knowledge distillation and modern tricks. It requires no adjustments or hyper-parameters tuning between different models, and is efficient in terms of training times. We test USI on a wide variety of architectures, including CNNs, Transformers, Mobile-oriented and MLP-only. On all models tested, USI outperforms previous state-of-the-art results. Hence, we are able to transform training on ImageNet from an expert-oriented task to an automatic seamless routine. Since USI accepts any backbone and trains it to top results, it also enables to perform methodical comparisons, and identify the most efficient backbones along the speed-accuracy Pareto curve. Implementation is available at:https://github.com/Alibaba-MIIL/Solving_ImageNet  ( 2 min )
    Zero-shot Code-Mixed Offensive Span Identification through Rationale Extraction. (arXiv:2205.06119v1 [cs.CL])
    This paper investigates the effectiveness of sentence-level transformers for zero-shot offensive span identification on a code-mixed Tamil dataset. More specifically, we evaluate rationale extraction methods of Local Interpretable Model Agnostic Explanations (LIME) \cite{DBLP:conf/kdd/Ribeiro0G16} and Integrated Gradients (IG) \cite{DBLP:conf/icml/SundararajanTY17} for adapting transformer based offensive language classification models for zero-shot offensive span identification. To this end, we find that LIME and IG show baseline $F_{1}$ of 26.35\% and 44.83\%, respectively. Besides, we study the effect of data set size and training process on the overall accuracy of span identification. As a result, we find both LIME and IG to show significant improvement with Masked Data Augmentation and Multilabel Training, with $F_{1}$ of 50.23\% and 47.38\% respectively. \textit{Disclaimer : This paper contains examples that may be considered profane, vulgar, or offensive. The examples do not represent the views of the authors or their employers/graduate schools towards any person(s), group(s), practice(s), or entity/entities. Instead they are used to emphasize only the linguistic research challenges.}  ( 2 min )
    How I failed machine learning in medical imaging -- shortcomings and recommendations. (arXiv:2103.10292v2 [eess.IV] UPDATED)
    Medical imaging is an important research field with many opportunities for improving patients' health. However, there are a number of challenges that are slowing down the progress of the field as a whole, such optimizing for publication. In this paper we reviewed several problems related to choosing datasets, methods, evaluation metrics, and publication strategies. With a review of literature and our own analysis, we show that at every step, potential biases can creep in. On a positive note, we also see that initiatives to counteract these problems are already being started. Finally we provide a broad range of recommendations on how to further these address problems in the future. For reproducibility, data and code for our analyses are available on \url{https://github.com/GaelVaroquaux/ml_med_imaging_failures}  ( 2 min )
    Framework for inferring empirical causal graphs from binary data to support multidimensional poverty analysis. (arXiv:2205.06131v1 [stat.ME])
    Poverty is one of the fundamental issues that mankind faces. Multidimensional Poverty Index (MPI) is deployed for measuring poverty issues in a population beyond monetary. However, MPI cannot provide information regarding associations and causal relations among poverty factors. Does education cause income inequality in a specific region? Is lacking education a cause of health issues? By not knowing causal relations, policy maker cannot pinpoint root causes of poverty issues of a specific population, which might not be the same across different population. Additionally, MPI requires binary data, which cannot be analyzed by most of causal inference frameworks. In this work, we proposed an exploratory-data-analysis framework for finding possible causal relations with confidence intervals among binary data. The proposed framework provides not only how severe the issue of poverty is, but it also provides the causal relations among poverty factors. Moreover, knowing a confidence interval of degree of causal direction lets us know how strong a causal relation is. We evaluated the proposed framework with several baseline approaches in simulation datasets as well as using two real-world datasets as case studies 1) Twin births of the United States: the relation between birth weight and mortality of twin, and 2) Thailand population surveys from 378k households of Chiang Mai and 353k households of Khon Kaen provinces. Our framework performed better than baselines in most cases. The first case study reveals almost all mortality cases in twins have issues of low birth weights but not all low-birth-weight twins were died. The second case study reveals that smoking associates with drinking alcohol in both provinces and there is a causal relation of smoking causes drinking alcohol in only Chiang Mai province. The framework can be applied beyond the poverty context.  ( 3 min )
    Machine Learning Workflow to Explain Black-box Models for Early Alzheimer's Disease Classification Evaluated for Multiple Datasets. (arXiv:2205.05907v1 [cs.LG])
    Purpose: Hard-to-interpret Black-box Machine Learning (ML) were often used for early Alzheimer's Disease (AD) detection. Methods: To interpret eXtreme Gradient Boosting (XGBoost), Random Forest (RF), and Support Vector Machine (SVM) black-box models a workflow based on Shapley values was developed. All models were trained on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset and evaluated for an independent ADNI test set, as well as the external Australian Imaging and Lifestyle flagship study of Ageing (AIBL), and Open Access Series of Imaging Studies (OASIS) datasets. Shapley values were compared to intuitively interpretable Decision Trees (DTs), and Logistic Regression (LR), as well as natural and permutation feature importances. To avoid the reduction of the explanation validity caused by correlated features, forward selection and aspect consolidation were implemented. Results: Some black-box models outperformed DTs and LR. The forward-selected features correspond to brain areas previously associated with AD. Shapley values identified biologically plausible associations with moderate to strong correlations with feature importances. The most important RF features to predict AD conversion were the volume of the amygdalae, and a cognitive test score. Good cognitive test performances and large brain volumes decreased the AD risk. The models trained using cognitive test scores significantly outperformed brain volumetric models ($p<0.05$). Cognitive Normal (CN) vs. AD models were successfully transferred to external datasets. Conclusion: In comparison to previous work, improved performances for ADNI and AIBL were achieved for CN vs. Mild Cognitive Impairment (MCI) classification using brain volumes. The Shapley values and the feature importances showed moderate to strong correlations.  ( 2 min )
    Sample Complexity Bounds for Robustly Learning Decision Lists against Evasion Attacks. (arXiv:2205.06127v1 [cs.LG])
    A fundamental problem in adversarial machine learning is to quantify how much training data is needed in the presence of evasion attacks. In this paper we address this issue within the framework of PAC learning, focusing on the class of decision lists. Given that distributional assumptions are essential in the adversarial setting, we work with probability distributions on the input data that satisfy a Lipschitz condition: nearby points have similar probability. Our key results illustrate that the adversary's budget (that is, the number of bits it can perturb on each input) is a fundamental quantity in determining the sample complexity of robust learning. Our first main result is a sample-complexity lower bound: the class of monotone conjunctions (essentially the simplest non-trivial hypothesis class on the Boolean hypercube) and any superclass has sample complexity at least exponential in the adversary's budget. Our second main result is a corresponding upper bound: for every fixed $k$ the class of $k$-decision lists has polynomial sample complexity against a $\log(n)$-bounded adversary. This sheds further light on the question of whether an efficient PAC learning algorithm can always be used as an efficient $\log(n)$-robust learning algorithm under the uniform distribution.  ( 2 min )
    Over-the-Air Federated Learning with Joint Adaptive Computation and Power Control. (arXiv:2205.05867v1 [cs.LG])
    This paper considers over-the-air federated learning (OTA-FL). OTA-FL exploits the superposition property of the wireless medium, and performs model aggregation over the air for free. Thus, it can greatly reduce the communication cost incurred in communicating model updates from the edge devices. In order to fully utilize this advantage while providing comparable learning performance to conventional federated learning that presumes model aggregation via noiseless channels, we consider the joint design of transmission scaling and the number of local iterations at each round, given the power constraint at each edge device. We first characterize the training error due to such channel noise in OTA-FL by establishing a fundamental lower bound for general functions with Lipschitz-continuous gradients. Then, by introducing an adaptive transceiver power scaling scheme, we propose an over-the-air federated learning algorithm with joint adaptive computation and power control (ACPC-OTA-FL). We provide the convergence analysis for ACPC-OTA-FL in training with non-convex objective functions and heterogeneous data. We show that the convergence rate of ACPC-OTA-FL matches that of FL with noise-free communications.  ( 2 min )
    Ethereum Fraud Detection with Heterogeneous Graph Neural Networks. (arXiv:2203.12363v2 [cs.LG] UPDATED)
    While transactions with cryptocurrencies such as Ethereum are becoming more prevalent, fraud and other criminal transactions are not uncommon. Graph analysis algorithms and machine learning techniques detect suspicious transactions that lead to phishing in large transaction networks. Many graph neural network (GNN) models have been proposed to apply deep learning techniques to graph structures. Although there is research on phishing detection using GNN models in the Ethereum transaction network, models that address the scale of the number of vertices and edges and the imbalance of labels have not yet been studied. In this paper, we compared the model performance of GNN models on the actual Ethereum transaction network dataset and phishing reported label data to exhaustively compare and verify which GNN models and hyperparameters produce the best accuracy. Specifically, we evaluated the model performance of representative homogeneous GNN models which consider single-type nodes and edges and heterogeneous GNN models which support different types of nodes and edges. We showed that heterogeneous models had better model performance than homogeneous models. In particular, the RGCN model achieved the best performance in the overall metrics.  ( 2 min )
    Anomaly Detection of Adversarial Examples using Class-conditional Generative Adversarial Networks. (arXiv:2105.10101v2 [cs.LG] UPDATED)
    Deep Neural Networks (DNNs) have been shown vulnerable to Test-Time Evasion attacks (TTEs, or adversarial examples), which, by making small changes to the input, alter the DNN's decision. We propose an unsupervised attack detector on DNN classifiers based on class-conditional Generative Adversarial Networks (GANs). We model the distribution of clean data conditioned on the predicted class label by an Auxiliary Classifier GAN (AC-GAN). Given a test sample and its predicted class, three detection statistics are calculated based on the AC-GAN Generator and Discriminator. Experiments on image classification datasets under various TTE attacks show that our method outperforms previous detection methods. We also investigate the effectiveness of anomaly detection using different DNN layers (input features or internal-layer features) and demonstrate, as one might expect, that anomalies are harder to detect using features closer to the DNN's output layer.  ( 2 min )
    Low-variance estimation in the Plackett-Luce model via quasi-Monte Carlo sampling. (arXiv:2205.06024v1 [stat.ML])
    The Plackett-Luce (PL) model is ubiquitous in learning-to-rank (LTR) because it provides a useful and intuitive probabilistic model for sampling ranked lists. Counterfactual offline evaluation and optimization of ranking metrics are pivotal for using LTR methods in production. When adopting the PL model as a ranking policy, both tasks require the computation of expectations with respect to the model. These are usually approximated via Monte-Carlo (MC) sampling, since the combinatorial scaling in the number of items to be ranked makes their analytical computation intractable. Despite recent advances in improving the computational efficiency of the sampling process via the Gumbel top-k trick, the MC estimates can suffer from high variance. We develop a novel approach to producing more sample-efficient estimators of expectations in the PL model by combining the Gumbel top-k trick with quasi-Monte Carlo (QMC) sampling, a well-established technique for variance reduction. We illustrate our findings both theoretically and empirically using real-world recommendation data from Amazon Music and the Yahoo learning-to-rank challenge.  ( 2 min )
    Multimodal Indoor Localisation for Measuring Mobility in Parkinson's Disease using Transformers. (arXiv:2205.06142v1 [cs.LG])
    Parkinson's disease (PD) is a slowly progressive debilitating neurodegenerative disease which is prominently characterised by motor symptoms. Indoor localisation, including number and speed of room to room transitions, provides a proxy outcome which represents mobility and could be used as a digital biomarker to quantify how mobility changes as this disease progresses. We use data collected from 10 people with Parkinson's, and 10 controls, each of whom lived for five days in a smart home with various sensors. In order to more effectively localise them indoors, we propose a transformer-based approach utilizing two data modalities, Received Signal Strength Indicator (RSSI) and accelerometer data from wearable devices, which provide complementary views of movement. Our approach makes asymmetric and dynamic correlations by a) learning temporal correlations at different scales and levels, and b) utilizing various gating mechanisms to select relevant features within modality and suppress unnecessary modalities. On a dataset with real patients, we demonstrate that our proposed method gives an average accuracy of 89.9%, outperforming competitors. We also show that our model is able to better predict in-home mobility for people with Parkinson's with an average offset of 1.13 seconds to ground truth.  ( 2 min )
    Improved Meta Learning for Low Resource Speech Recognition. (arXiv:2205.06182v1 [cs.CL])
    We propose a new meta learning based framework for low resource speech recognition that improves the previous model agnostic meta learning (MAML) approach. The MAML is a simple yet powerful meta learning approach. However, the MAML presents some core deficiencies such as training instabilities and slower convergence speed. To address these issues, we adopt multi-step loss (MSL). The MSL aims to calculate losses at every step of the inner loop of MAML and then combines them with a weighted importance vector. The importance vector ensures that the loss at the last step has more importance than the previous steps. Our empirical evaluation shows that MSL significantly improves the stability of the training procedure and it thus also improves the accuracy of the overall system. Our proposed system outperforms MAML based low resource ASR system on various languages in terms of character error rates and stable training behavior.  ( 2 min )
    SpecRepair: Counter-Example Guided Safety Repair of Deep Neural Networks. (arXiv:2106.01917v5 [cs.LG] UPDATED)
    Deep neural networks (DNNs) are increasingly applied in safety-critical domains, such as self-driving cars, unmanned aircraft, and medical diagnosis. It is of fundamental importance to certify the safety of these DNNs, i.e. that they comply with a formal safety specification. While safety certification tools exactly answer this question, they are of no help in debugging unsafe DNNs, requiring the developer to iteratively verify and modify the DNN until safety is eventually achieved. Hence, a repair technique needs to be developed that can produce a safe DNN automatically. To address this need, we present SpecRepair, a tool that efficiently eliminates counter-examples from a DNN and produces a provably safe DNN without harming its classification accuracy. SpecRepair combines specification-based counter-example search and resumes training of the DNN, penalizing counter-examples and certifying the resulting DNN. We evaluate SpecRepair's effectiveness on the ACAS Xu benchmark, a DNN-based controller for unmanned aircraft, and two image classification benchmarks. The results show that SpecRepair is more successful in producing safe DNNs than comparable methods, has a shorter runtime, and produces safe DNNs while preserving their classification accuracy.  ( 2 min )
    Ensemble Classifier Design Tuned to Dataset Characteristics for Network Intrusion Detection. (arXiv:2205.06177v1 [cs.CR])
    Machine Learning-based supervised approaches require highly customized and fine-tuned methodologies to deliver outstanding performance. This paper presents a dataset-driven design and performance evaluation of a machine learning classifier for the network intrusion dataset UNSW-NB15. Analysis of the dataset suggests that it suffers from class representation imbalance and class overlap in the feature space. We employed ensemble methods using Balanced Bagging (BB), eXtreme Gradient Boosting (XGBoost), and Random Forest empowered by Hellinger Distance Decision Tree (RF-HDDT). BB and XGBoost are tuned to handle the imbalanced data, and Random Forest (RF) classifier is supplemented by the Hellinger metric to address the imbalance issue. Two new algorithms are proposed to address the class overlap issue in the dataset. These two algorithms are leveraged to help improve the performance of the testing dataset by modifying the final classification decision made by three base classifiers as part of the ensemble classifier which employs a majority vote combiner. The proposed design is evaluated for both binary and multi-category classification. Comparing the proposed model to those reported on the same dataset in the literature demonstrate that the proposed model outperforms others by a significant margin for both binary and multi-category classification cases.
    Segmentation-Consistent Probabilistic Lesion Counting. (arXiv:2204.05276v2 [eess.IV] UPDATED)
    Lesion counts are important indicators of disease severity, patient prognosis, and treatment efficacy, yet counting as a task in medical imaging is often overlooked in favor of segmentation. This work introduces a novel continuously differentiable function that maps lesion segmentation predictions to lesion count probability distributions in a consistent manner. The proposed end-to-end approach--which consists of voxel clustering, lesion-level voxel probability aggregation, and Poisson-binomial counting--is non-parametric and thus offers a robust and consistent way to augment lesion segmentation models with post hoc counting capabilities. Experiments on Gadolinium-enhancing lesion counting demonstrate that our method outputs accurate and well-calibrated count distributions that capture meaningful uncertainty information. They also reveal that our model is suitable for multi-task learning of lesion segmentation, is efficient in low data regimes, and is robust to adversarial attacks.  ( 2 min )
    NER-MQMRC: Formulating Named Entity Recognition as Multi Question Machine Reading Comprehension. (arXiv:2205.05904v1 [cs.LG])
    NER has been traditionally formulated as a sequence labeling task. However, there has been recent trend in posing NER as a machine reading comprehension task (Wang et al., 2020; Mengge et al., 2020), where entity name (or other information) is considered as a question, text as the context and entity value in text as answer snippet. These works consider MRC based on a single question (entity) at a time. We propose posing NER as a multi-question MRC task, where multiple questions (one question per entity) are considered at the same time for a single text. We propose a novel BERT-based multi-question MRC (NER-MQMRC) architecture for this formulation. NER-MQMRC architecture considers all entities as input to BERT for learning token embeddings with self-attention and leverages BERT-based entity representation for further improving these token embeddings for NER task. Evaluation on three NER datasets show that our proposed architecture leads to average 2.5 times faster training and 2.3 times faster inference as compared to NER-SQMRC framework based models by considering all entities together in a single pass. Further, we show that our model performance does not degrade compared to single-question based MRC (NER-SQMRC) (Devlin et al., 2019) leading to F1 gain of +0.41%, +0.32% and +0.27% for AE-Pub, Ecommerce5PT and Twitter datasets respectively. We propose this architecture primarily to solve large scale e-commerce attribute (or entity) extraction from unstructured text of a magnitude of 50k+ attributes to be extracted on a scalable production environment with high performance and optimised training and inference runtimes.  ( 2 min )
    Addressing Census data problems in race imputation via fully Bayesian Improved Surname Geocoding and name supplements. (arXiv:2205.06129v1 [stat.ML])
    Prediction of an individual's race and ethnicity plays an important role in social science and public health research. Examples include studies of racial disparity in health and voting. Recently, Bayesian Improved Surname Geocoding (BISG), which uses Bayes' rule to combine information from Census surname files with the geocoding of an individual's residence, has emerged as a leading methodology for this prediction task. Unfortunately, BISG suffers from two Census data problems that contribute to unsatisfactory predictive performance for minorities. First, the decennial Census often contains zero counts for minority racial groups in the Census blocks where some members of those groups reside. Second, because the Census surname files only include frequent names, many surnames -- especially those of minorities -- are missing from the list. To address the zero counts problem, we introduce a fully Bayesian Improved Surname Geocoding (fBISG) methodology that accounts for potential measurement error in Census counts by extending the na\"ive Bayesian inference of the BISG methodology to full posterior inference. To address the missing surname problem, we supplement the Census surname data with additional data on last, first, and middle names taken from the voter files of six Southern states where self-reported race is available. Our empirical validation shows that the fBISG methodology and name supplements significantly improve the accuracy of race imputation across all racial groups, and especially for Asians. The proposed methodology, together with additional name data, is available via the open-source software package wru.  ( 2 min )
    The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning. (arXiv:2205.06226v1 [cs.LG])
    Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL) method by Grill et al. shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. This phenomenon is a typical example of implicit bias in deep learning and remains little understood. In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. Theoretically, we present a framework to understand the behavior of the trainable, but identity-initialized prediction head. Under a simple setting, we characterized the substitution effect and acceleration effect of the prediction head. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects enable the neural networks to learn all the features rather than focus only on learning the stronger features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.  ( 2 min )
    Gauge equivariant neural networks for quantum lattice gauge theories. (arXiv:2012.05232v2 [cond-mat.str-el] UPDATED)
    Gauge symmetries play a key role in physics appearing in areas such as quantum field theories of the fundamental particles and emergent degrees of freedom in quantum materials. Motivated by the desire to efficiently simulate many-body quantum systems with exact local gauge invariance, gauge equivariant neural-network quantum states are introduced, which exactly satisfy the local Hilbert space constraints necessary for the description of quantum lattice gauge theory with Zd gauge group on different geometries. Focusing on the special case of Z2 gauge group on a periodically identified square lattice, the equivariant architecture is analytically shown to contain the loop-gas solution as a special case. Gauge equivariant neural-network quantum states are used in combination with variational quantum Monte Carlo to obtain compact descriptions of the ground state wavefunction for the Z2 theory away from the exactly solvable limit, and to demonstrate the confining/deconfining phase transition of the Wilson loop order parameter.  ( 2 min )
    ELODI: Ensemble Logit Difference Inhibition for Positive-Congruent Training. (arXiv:2205.06265v1 [cs.LG])
    Negative flips are errors introduced in a classification system when a legacy model is replaced with a new one. Existing methods to reduce the negative flip rate (NFR) either do so at the expense of overall accuracy using model distillation, or use ensembles, which multiply inference cost prohibitively. We present a method to train a classification system that achieves paragon performance in both error rate and NFR, at the inference cost of a single model. Our method introduces a generalized distillation objective, Logit Difference Inhibition (LDI), that penalizes changes in the logits between the new and old model, without forcing them to coincide as in ordinary distillation. LDI affords the model flexibility to reduce error rate along with NFR. The method uses a homogeneous ensemble as the reference model for LDI, hence the name Ensemble LDI, or ELODI. The reference model can then be substituted with a single model at inference time. The method leverages the observation that negative flips are typically not close to the decision boundary, but often exhibit large deviations in the distance among their logits, which are reduced by ELODI.  ( 2 min )
    Learning Generalized Policies Without Supervision Using GNNs. (arXiv:2205.06002v1 [cs.AI])
    We consider the problem of learning generalized policies for classical planning domains using graph neural networks from small instances represented in lifted STRIPS. The problem has been considered before but the proposed neural architectures are complex and the results are often mixed. In this work, we use a simple and general GNN architecture and aim at obtaining crisp experimental results and a deeper understanding: either the policy greedy in the learned value function achieves close to 100% generalization over instances larger than those used in training, or the failure must be understood, and possibly fixed, logically. For this, we exploit the relation established between the expressive power of GNNs and the $C_{2}$ fragment of first-order logic (namely, FOL with 2 variables and counting quantifiers). We find for example that domains with general policies that require more expressive features can be solved with GNNs once the states are extended with suitable "derived atoms" encoding role compositions and transitive closures that do not fit into $C_{2}$. The work follows the GNN approach for learning optimal general policies in a supervised fashion (Stahlberg, Bonet, Geffner, 2022); but the learned policies are no longer required to be optimal (which expands the scope, as many planning domains do not have general optimal policies) and are learned without supervision. Interestingly, value-based reinforcement learning methods that aim to produce optimal policies, do not always yield policies that generalize, as the goals of optimality and generality are in conflict in domains where optimal planning is NP-hard.  ( 2 min )
    Learning more skills through optimistic exploration. (arXiv:2107.14226v6 [cs.LG] UPDATED)
    Unsupervised skill learning objectives (Gregor et al., 2016, Eysenbach et al., 2018) allow agents to learn rich repertoires of behavior in the absence of extrinsic rewards. They work by simultaneously training a policy to produce distinguishable latent-conditioned trajectories, and a discriminator to evaluate distinguishability by trying to infer latents from trajectories. The hope is for the agent to explore and master the environment by encouraging each skill (latent) to reliably reach different states. However, an inherent exploration problem lingers: when a novel state is actually encountered, the discriminator will necessarily not have seen enough training data to produce accurate and confident skill classifications, leading to low intrinsic reward for the agent and effective penalization of the sort of exploration needed to actually maximize the objective. To combat this inherent pessimism towards exploration, we derive an information gain auxiliary objective that involves training an ensemble of discriminators and rewarding the policy for their disagreement. Our objective directly estimates the epistemic uncertainty that comes from the discriminator not having seen enough training examples, thus providing an intrinsic reward more tailored to the true objective compared to pseudocount-based methods (Burda et al., 2019). We call this exploration bonus discriminator disagreement intrinsic reward, or DISDAIN. We demonstrate empirically that DISDAIN improves skill learning both in a tabular grid world (Four Rooms) and the 57 games of the Atari Suite (from pixels). Thus, we encourage researchers to treat pessimism with DISDAIN.  ( 2 min )
    AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control. (arXiv:2104.02180v2 [cs.GR] UPDATED)
    Synthesizing graceful and life-like behaviors for physically simulated characters has been a fundamental challenge in computer animation. Data-driven methods that leverage motion tracking are a prominent class of techniques for producing high fidelity motions for a wide range of behaviors. However, the effectiveness of these tracking-based methods often hinges on carefully designed objective functions, and when applied to large and diverse motion datasets, these methods require significant additional machinery to select the appropriate motion for the character to track in a given scenario. In this work, we propose to obviate the need to manually design imitation objectives and mechanisms for motion selection by utilizing a fully automated approach based on adversarial imitation learning. High-level task objectives that the character should perform can be specified by relatively simple reward functions, while the low-level style of the character's behaviors can be specified by a dataset of unstructured motion clips, without any explicit clip selection or sequencing. These motion clips are used to train an adversarial motion prior, which specifies style-rewards for training the character through reinforcement learning (RL). The adversarial RL procedure automatically selects which motion to perform, dynamically interpolating and generalizing from the dataset. Our system produces high-quality motions that are comparable to those achieved by state-of-the-art tracking-based techniques, while also being able to easily accommodate large datasets of unstructured motion clips. Composition of disparate skills emerges automatically from the motion prior, without requiring a high-level motion planner or other task-specific annotations of the motion clips. We demonstrate the effectiveness of our framework on a diverse cast of complex simulated characters and a challenging suite of motor control tasks.  ( 3 min )
    A Comparative Survey of Deep Active Learning. (arXiv:2203.13450v2 [cs.LG] UPDATED)
    Active Learning (AL) is a set of techniques for reducing labeling cost by sequentially selecting data samples from a large unlabeled data pool for labeling. Meanwhile, Deep Learning (DL) is data-hungry, and the performance of DL models scales monotonically with more training data. Therefore, in recent years, Deep Active Learning (DAL) has risen as feasible solutions for maximizing model performance while minimizing the expensive labeling cost. Abundant methods have sprung up and literature reviews of DAL have been presented before. However, the performance comparison of different branches of DAL methods under various tasks is still insufficient and our work fills this gap. In this paper, we survey and categorize DAL-related work and construct comparative experiments across frequently used datasets and DAL algorithms. Additionally, we explore some factors (e.g., batch size, number of epochs in the training process) that influence the efficacy of DAL, which provides better references for researchers to design their own DAL experiments or carry out DAL-related applications. We construct a DAL toolkit, DeepAL+, by re-implementing many highly-cited DAL-related methods, and it will be released to the public.  ( 2 min )
    Exploiting Inductive Bias in Transformers for Unsupervised Disentanglement of Syntax and Semantics with VAEs. (arXiv:2205.05943v1 [cs.CL])
    We propose a generative model for text generation, which exhibits disentangled latent representations of syntax and semantics. Contrary to previous work, this model does not need syntactic information such as constituency parses, or semantic information such as paraphrase pairs. Our model relies solely on the inductive bias found in attention-based architectures such as Transformers. In the attention of Transformers, keys handle information selection while values specify what information is conveyed. Our model, dubbed QKVAE, uses Attention in its decoder to read latent variables where one latent variable infers keys while another infers values. We run experiments on latent representations and experiments on syntax/semantics transfer which show that QKVAE displays clear signs of disentangled syntax and semantics. We also show that our model displays competitive syntax transfer capabilities when compared to supervised models and that comparable supervised models need a fairly large amount of data (more than 50K samples) to outperform it on both syntactic and semantic transfer. The code for our experiments is publicly available.  ( 2 min )
    Exploiting symmetry in variational quantum machine learning. (arXiv:2205.06217v1 [quant-ph])
    Variational quantum machine learning is an extensively studied application of near-term quantum computers. The success of variational quantum learning models crucially depends on finding a suitable parametrization of the model that encodes an inductive bias relevant to the learning task. However, precious little is known about guiding principles for the construction of suitable parametrizations. In this work, we holistically explore when and how symmetries of the learning problem can be exploited to construct quantum learning models with outcomes invariant under the symmetry of the learning task. Building on tools from representation theory, we show how a standard gateset can be transformed into an equivariant gateset that respects the symmetries of the problem at hand through a process of gate symmetrization. We benchmark the proposed methods on two toy problems that feature a non-trivial symmetry and observe a substantial increase in generalization performance. As our tools can also be applied in a straightforward way to other variational problems with symmetric structure, we show how equivariant gatesets can be used in variational quantum eigensolvers.  ( 2 min )
    Localized Vision-Language Matching for Open-vocabulary Object Detection. (arXiv:2205.06160v1 [cs.CV])
    In this work, we propose an open-world object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes. It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels for both novel and known classes in a weakly-supervised manner and second specializes the model for the object detection task using known class annotations. We show that a simple language model fits better than a large contextualized language model for detecting novel objects. Moreover, we introduce a consistency-regularization technique to better exploit image-caption pair information. Our method compares favorably to existing open-world detection approaches while being data-efficient.  ( 2 min )
    Equivariant quantum circuits for learning on weighted graphs. (arXiv:2205.06109v1 [quant-ph])
    Variational quantum algorithms are the leading candidate for near-term advantage on noisy quantum hardware. When training a parametrized quantum circuit to solve a specific task, the choice of ansatz is one of the most important factors that determines the trainability and performance of the algorithm. Problem-tailored ansatzes have become the standard for tasks in optimization or quantum chemistry, and yield more efficient algorithms with better performance than unstructured approaches. In quantum machine learning (QML), however, the literature on ansatzes that are motivated by the training data structure is scarce. Considering that it is widely known that unstructured ansatzes can become untrainable with increasing system size and circuit depth, it is of key importance to also study problem-tailored circuit architectures in a QML context. In this work, we introduce an ansatz for learning tasks on weighted graphs that respects an important graph symmetry, namely equivariance under node permutations. We evaluate the performance of this ansatz on a complex learning task on weighted graphs, where a ML model is used to implement a heuristic for a combinatorial optimization problem. We analytically study the expressivity of our ansatz at depth one, and numerically compare the performance of our model on instances with up to 20 qubits to ansatzes where the equivariance property is gradually broken. We show that our ansatz outperforms all others even in the small-instance regime. Our results strengthen the notion that symmetry-preserving ansatzes are a key to success in QML and should be an active area of research in order to enable near-term advantages in this field.  ( 2 min )
    Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets. (arXiv:2205.06218v1 [cs.CV])
    This paper performs comprehensive analysis on datasets for occlusion-aware face segmentation, a task that is crucial for many downstream applications. The collection and annotation of such datasets are time-consuming and labor-intensive. Although some efforts have been made in synthetic data generation, the naturalistic aspect of data remains less explored. In our study, we propose two occlusion generation techniques, Naturalistic Occlusion Generation (NatOcc), for producing high-quality naturalistic synthetic occluded faces; and Random Occlusion Generation (RandOcc), a more general synthetic occluded data generation method. We empirically show the effectiveness and robustness of both methods, even for unseen occlusions. To facilitate model evaluation, we present two high-resolution real-world occluded face datasets with fine-grained annotations, RealOcc and RealOcc-Wild, featuring both careful alignment preprocessing and an in-the-wild setting for robustness test. We further conduct a comprehensive analysis on a newly introduced segmentation benchmark, offering insights for future exploration.  ( 2 min )
    E-Mail Assistant -- Automation of E-Mail Handling and Management using Robotic Process Automation. (arXiv:2205.05882v1 [cs.LG])
    In this paper, a workflow for designing a bot using Robotic Process Automation (RPA), associated with Artificial Intelligence (AI) that is used for information extraction, classification, etc., is proposed. The bot is equipped with many features that make email handling a stress-free job. It automatically login into the mailbox through secured channels, distinguishes between the useful and not useful emails, classifies the emails into different labels, downloads the attached files, creates different directories, and stores the downloaded files into relevant directories. It moves the not useful emails into the trash. Further, the bot can also be trained to rename the attached files with the names of the sender/applicant in case of a job application for the sake of convenience. The bot is designed and tested using the UiPath tool to improve the performance of the system. The paper also discusses the further possible functionalities that can be added on to the bot.  ( 2 min )
    Towards Robust Unsupervised Disentanglement of Sequential Data -- A Case Study Using Music Audio. (arXiv:2205.05871v1 [cs.SD])
    Disentangled sequential autoencoders (DSAEs) represent a class of probabilistic graphical models that describes an observed sequence with dynamic latent variables and a static latent variable. The former encode information at a frame rate identical to the observation, while the latter globally governs the entire sequence. This introduces an inductive bias and facilitates unsupervised disentanglement of the underlying local and global factors. In this paper, we show that the vanilla DSAE suffers from being sensitive to the choice of model architecture and capacity of the dynamic latent variables, and is prone to collapse the static latent variable. As a countermeasure, we propose TS-DSAE, a two-stage training framework that first learns sequence-level prior distributions, which are subsequently employed to regularise the model and facilitate auxiliary objectives to promote disentanglement. The proposed framework is fully unsupervised and robust against the global factor collapse problem across a wide range of model configurations. It also avoids typical solutions such as adversarial training which usually involves laborious parameter tuning, and domain-specific data augmentation. We conduct quantitative and qualitative evaluations to demonstrate its robustness in terms of disentanglement on both artificial and real-world music audio datasets.  ( 2 min )
    Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation. (arXiv:2205.06053v1 [cs.SD])
    This paper introduces a unified source-filter network with a harmonic-plus-noise source excitation generation mechanism. In our previous work, we proposed unified Source-Filter GAN (uSFGAN) for developing a high-fidelity neural vocoder with flexible voice controllability using a unified source-filter neural network architecture. However, the capability of uSFGAN to model the aperiodic source excitation signal is insufficient, and there is still a gap in sound quality between the natural and generated speech. To improve the source excitation modeling and generated sound quality, a new source excitation generation network separately generating periodic and aperiodic components is proposed. The advanced adversarial training procedure of HiFiGAN is also adopted to replace that of Parallel WaveGAN used in the original uSFGAN. Both objective and subjective evaluation results show that the modified uSFGAN significantly improves the sound quality of the basic uSFGAN while maintaining the voice controllability.  ( 2 min )
    Optimal transport weights for causal inference. (arXiv:2109.01991v4 [stat.ME] UPDATED)
    Imbalance in covariate distributions leads to biased estimates of causal effects. Weighting methods attempt to correct this imbalance but rely on specifying models for the treatment assignment mechanism, which is unknown in observational studies. This leaves researchers to choose the proper weighting method and the appropriate covariate functions for these models without knowing the correct combination to achieve distributional balance. In response to these difficulties, we propose a nonparametric generalization of several other weighting schemes found in the literature: Causal Optimal Transport. This new method directly targets distributional balance by minimizing optimal transport distances between treatment and control groups or, more generally, between any source and target population. Our approach is semiparametrically efficient and model-free but can also incorporate moments or any other important functions of covariates that a researcher desires to balance. Moreover, our method can provide nonparametric estimate the conditional mean outcome function and we give rates for the convergence of this estimator. Moreover, we show how this method can provide nonparametric imputations of the missing potential outcomes and give rates of convergence for this estimator. We find that Causal Optimal Transport outperforms competitor methods when both the propensity score and outcome models are misspecified, indicating it is a robust alternative to common weighting methods. Finally, we demonstrate the utility of our method in an external control trial examining the effect of misoprostol versus oxytocin for the treatment of post-partum hemorrhage.  ( 2 min )
    Communicative Subgraph Representation Learning for Multi-Relational Inductive Drug-Gene Interaction Prediction. (arXiv:2205.05957v1 [cs.LG])
    Illuminating the interconnections between drugs and genes is an important topic in drug development and precision medicine. Currently, computational predictions of drug-gene interactions mainly focus on the binding interactions without considering other relation types like agonist, antagonist, etc. In addition, existing methods either heavily rely on high-quality domain features or are intrinsically transductive, which limits the capacity of models to generalize to drugs/genes that lack external information or are unseen during the training process. To address these problems, we propose a novel Communicative Subgraph representation learning for Multi-relational Inductive drug-Gene interactions prediction (CoSMIG), where the predictions of drug-gene relations are made through subgraph patterns, and thus are naturally inductive for unseen drugs/genes without retraining or utilizing external domain features. Moreover, the model strengthened the relations on the drug-gene graph through a communicative message passing mechanism. To evaluate our method, we compiled two new benchmark datasets from DrugBank and DGIdb. The comprehensive experiments on the two datasets showed that our method outperformed state-of-the-art baselines in the transductive scenarios and achieved superior performance in the inductive ones. Further experimental analysis including LINCS experimental validation and literature verification also demonstrated the value of our model.
    Embodied vision for learning object representations. (arXiv:2205.06198v1 [cs.LG])
    Recent time-contrastive learning approaches manage to learn invariant object representations without supervision. This is achieved by mapping successive views of an object onto close-by internal representations. When considering this learning approach as a model of the development of human object recognition, it is important to consider what visual input a toddler would typically observe while interacting with objects. First, human vision is highly foveated, with high resolution only available in the central region of the field of view. Second, objects may be seen against a blurry background due to infants' limited depth of field. Third, during object manipulation a toddler mostly observes close objects filling a large part of the field of view due to their rather short arms. Here, we study how these effects impact the quality of visual representations learnt through time-contrastive learning. To this end, we let a visually embodied agent "play" with objects in different locations of a near photo-realistic flat. During each play session the agent views an object in multiple orientations before turning its body to view another object. The resulting sequence of views feeds a time-contrastive learning algorithm. Our results show that visual statistics mimicking those of a toddler improve object recognition accuracy in both familiar and novel environments. We argue that this effect is caused by the reduction of features extracted in the background, a neural network bias for large features in the image and a greater similarity between novel and familiar background regions. We conclude that the embodied nature of visual learning may be crucial for understanding the development of human object perception.
    Smooth-Reduce: Leveraging Patches for Improved Certified Robustness. (arXiv:2205.06154v1 [cs.LG])
    Randomized smoothing (RS) has been shown to be a fast, scalable technique for certifying the robustness of deep neural network classifiers. However, methods based on RS require augmenting data with large amounts of noise, which leads to significant drops in accuracy. We propose a training-free, modified smoothing approach, Smooth-Reduce, that leverages patching and aggregation to provide improved classifier certificates. Our algorithm classifies overlapping patches extracted from an input image, and aggregates the predicted logits to certify a larger radius around the input. We study two aggregation schemes -- max and mean -- and show that both approaches provide better certificates in terms of certified accuracy, average certified radii and abstention rates as compared to concurrent approaches. We also provide theoretical guarantees for such certificates, and empirically show significant improvements over other randomized smoothing methods that require expensive retraining. Further, we extend our approach to videos and provide meaningful certificates for video classifiers. A project page can be found at https://nyu-dice-lab.github.io/SmoothReduce/  ( 2 min )
    GPN: A Joint Structural Learning Framework for Graph Neural Networks. (arXiv:2205.05964v1 [cs.LG])
    Graph neural networks (GNNs) have been applied into a variety of graph tasks. Most existing work of GNNs is based on the assumption that the given graph data is optimal, while it is inevitable that there exists missing or incomplete edges in the graph data for training, leading to degraded performance. In this paper, we propose Generative Predictive Network (GPN), a GNN-based joint learning framework that simultaneously learns the graph structure and the downstream task. Specifically, we develop a bilevel optimization framework for this joint learning task, in which the upper optimization (generator) and the lower optimization (predictor) are both instantiated with GNNs. To the best of our knowledge, our method is the first GNN-based bilevel optimization framework for resolving this task. Through extensive experiments, our method outperforms a wide range of baselines using benchmark datasets.
    Graph Masked Autoencoders with Transformers. (arXiv:2202.08391v2 [cs.LG] UPDATED)
    Recently, transformers have shown promising performance in learning graph representations. However, there are still some challenges when applying transformers to real-world scenarios due to the fact that deep transformers are hard to train from scratch and the quadratic memory consumption w.r.t. the number of nodes. In this paper, we propose Graph Masked Autoencoders (GMAEs), a self-supervised transformer-based model for learning graph representations. To address the above two challenges, we adopt the masking mechanism and the asymmetric encoder-decoder design. Specifically, GMAE takes partially masked graphs as input, and reconstructs the features of the masked nodes. The encoder and decoder are asymmetric, where the encoder is a deep transformer and the decoder is a shallow transformer. The masking mechanism and the asymmetric design make GMAE a memory-efficient model compared with conventional transformers. We show that, when serving as a conventional self-supervised graph representation model, GMAE achieves state-of-the-art performance on both the graph classification task and the node classification task under common downstream evaluation protocols. We also show that, compared with training in an end-to-end manner from scratch, we can achieve comparable performance after pre-training and fine-tuning using GMAE while simplifying the training process.
    A non-asymptotic approach for model selection via penalization in high-dimensional mixture of experts models. (arXiv:2104.02640v2 [math.ST] UPDATED)
    Mixture of experts (MoE) are a popular class of statistical and machine learning models that have gained attention over the years due to their flexibility and efficiency. In this work, we consider Gaussian-gated localized MoE (GLoME) and block-diagonal covariance localized MoE (BLoME) regression models to present nonlinear relationships in heterogeneous data with potential hidden graph-structured interactions between high-dimensional predictors. These models pose difficult statistical estimation and model selection questions, both from a computational and theoretical perspective. This paper is devoted to the study of the problem of model selection among a collection of GLoME or BLoME models characterized by the number of mixture components, the complexity of Gaussian mean experts, and the hidden block-diagonal structures of the covariance matrices, in a penalized maximum likelihood estimation framework. In particular, we establish non-asymptotic risk bounds that take the form of weak oracle inequalities, provided that lower bounds for the penalties hold. The good empirical behavior of our models is then demonstrated on synthetic and real datasets.
    Robustness and Reliability When Training With Noisy Labels. (arXiv:2110.03321v2 [stat.ML] UPDATED)
    Labelling of data for supervised learning can be costly and time-consuming and the risk of incorporating label noise in large data sets is imminent. When training a flexible discriminative model using a strictly proper loss, such noise will inevitably shift the solution towards the conditional distribution over noisy labels. Nevertheless, while deep neural networks have proven capable of fitting random labels, regularisation and the use of robust loss functions empirically mitigate the effects of label noise. However, such observations concern robustness in accuracy, which is insufficient if reliable uncertainty quantification is critical. We demonstrate this by analysing the properties of the conditional distribution over noisy labels for an input-dependent noise model. In addition, we evaluate the set of robust loss functions characterised by noise-insensitive, asymptotic risk minimisers. We find that strictly proper and robust loss functions both offer asymptotic robustness in accuracy, but neither guarantee that the final model is calibrated. Moreover, even with robust loss functions, overfitting is an issue in practice. With these results, we aim to explain observed robustness of common training practices, such as early stopping, to label noise. In addition, we aim to encourage the development of new noise-robust algorithms that not only preserve accuracy but that also ensure reliability.
    A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation. (arXiv:2203.09893v2 [cs.SD] UPDATED)
    Automatic Music Transcription (AMT) has been recognized as a key enabling technology with a wide range of applications. Given the task's complexity, best results have typically been reported for systems focusing on specific settings, e.g. instrument-specific systems tend to yield improved results over instrument-agnostic methods. Similarly, higher accuracy can be obtained when only estimating frame-wise $f_0$ values and neglecting the harder note event detection. Despite their high accuracy, such specialized systems often cannot be deployed in the real-world. Storage and network constraints prohibit the use of multiple specialized models, while memory and run-time constraints limit their complexity. In this paper, we propose a lightweight neural network for musical instrument transcription, which supports polyphonic outputs and generalizes to a wide variety of instruments (including vocals). Our model is trained to jointly predict frame-wise onsets, multipitch and note activations, and we experimentally show that this multi-output structure improves the resulting frame-level note accuracy. Despite its simplicity, benchmark results show our system's note estimation to be substantially better than a comparable baseline, and its frame-level accuracy to be only marginally below those of specialized state-of-the-art AMT systems. With this work we hope to encourage the community to further investigate low-resource, instrument-agnostic AMT systems.
    Virtual twins of nonlinear vibrating multiphysics microstructures: physics-based versus deep learning-based approaches. (arXiv:2205.05928v1 [math.DS])
    Micro-Electro-Mechanical-Systems are complex structures, often involving nonlinearites of geometric and multiphysics nature, that are used as sensors and actuators in countless applications. Starting from full-order representations, we apply deep learning techniques to generate accurate, efficient and real-time reduced order models to be used as virtual twin for the simulation and optimization of higher-level complex systems. We extensively test the reliability of the proposed procedures on micromirrors, arches and gyroscopes, also displaying intricate dynamical evolutions like internal resonances. In particular, we discuss the accuracy of the deep learning technique and its ability to replicate and converge to the invariant manifolds predicted using the recently developed direct parametrization approach that allows extracting the nonlinear normal modes of large finite element models. Finally, by addressing an electromechanical gyroscope, we show that the non-intrusive deep learning approach generalizes easily to complex multiphysics problems
    Pseudo-Label Guided Multi-Contrast Generalization for Non-Contrast Organ-Aware Segmentation. (arXiv:2205.05898v1 [eess.IV])
    Non-contrast computed tomography (NCCT) is commonly acquired for lung cancer screening, assessment of general abdominal pain or suspected renal stones, trauma evaluation, and many other indications. However, the absence of contrast limits distinguishing organ in-between boundaries. In this paper, we propose a novel unsupervised approach that leverages pairwise contrast-enhanced CT (CECT) context to compute non-contrast segmentation without ground-truth label. Unlike generative adversarial approaches, we compute the pairwise morphological context with CECT to provide teacher guidance instead of generating fake anatomical context. Additionally, we further augment the intensity correlations in 'organ-specific' settings and increase the sensitivity to organ-aware boundary. We validate our approach on multi-organ segmentation with paired non-contrast & contrast-enhanced CT scans using five-fold cross-validation. Full external validations are performed on an independent non-contrast cohort for aorta segmentation. Compared with current abdominal organs segmentation state-of-the-art in fully supervised setting, our proposed pipeline achieves a significantly higher Dice by 3.98% (internal multi-organ annotated), and 8.00% (external aorta annotated) for abdominal organs segmentation. The code and pretrained models are publicly available at https://github.com/MASILab/ContrastMix.  ( 2 min )
    SIBILA: High-performance computing and interpretable machine learning join efforts toward personalised medicine in a novel decision-making tool. (arXiv:2205.06234v1 [cs.LG])
    Background and Objectives: Personalised medicine remains a major challenge for scientists. The rapid growth of Machine learning and Deep learning has made it a feasible alternative for predicting the most appropriate therapy for individual patients. However, the lack of interpretation of their results and high computational requirements make many reluctant to use these methods. Methods: Several Machine learning and Deep learning models have been implemented into a single software tool, SIBILA. Once the models are trained, SIBILA applies a range of interpretability methods to identify the input features that each model considered the most important to predict. In addition, all the features obtained are put in common to estimate the global attribution of each variable to the predictions. To facilitate its use by non-experts, SIBILA is also available to all users free of charge as a web server at https://bio-hpc.ucam.edu/sibila/. Results: SIBILA has been applied to three case studies to show its accuracy and efficiency in classification and regression problems. The first two cases proved that SIBILA can make accurate predictions even on uncleaned datasets. The last case demonstrates that SIBILA can be applied to medical contexts with real data. Conclusion: With the aim of becoming a powerful decision-making tool for clinicians, SIBILA has been developed. SIBILA is a novel software tool that leverages interpretable machine learning to make accurate predictions and explain how models made those decisions. SIBILA can be run on high-performance computing platforms, drastically reducing computing times.  ( 2 min )
    Combining Learning from Human Feedback and Knowledge Engineering to Solve Hierarchical Tasks in Minecraft. (arXiv:2112.03482v2 [cs.LG] UPDATED)
    Real-world tasks of interest are generally poorly defined by human-readable descriptions and have no pre-defined reward signals unless it is defined by a human designer. Conversely, data-driven algorithms are often designed to solve a specific, narrowly defined, task with performance metrics that drives the agent's learning. In this work, we present the solution that won first place and was awarded the most human-like agent in the 2021 NeurIPS Competition MineRL BASALT Challenge: Learning from Human Feedback in Minecraft, which challenged participants to use human data to solve four tasks defined only by a natural language description and no reward function. Our approach uses the available human demonstration data to train an imitation learning policy for navigation and additional human feedback to train an image classifier. These modules, combined with an estimated odometry map, become a powerful state-machine designed to utilize human knowledge in a natural hierarchical paradigm. We compare this hybrid intelligence approach to both end-to-end machine learning and pure engineered solutions, which are then judged by human evaluators. Codebase is available at https://github.com/viniciusguigo/kairos_minerl_basalt.  ( 2 min )
    An $l_1$-oracle inequality for the Lasso in mixture-of-experts regression models. (arXiv:2009.10622v3 [math.ST] UPDATED)
    Mixture-of-experts (MoE) models are a popular framework for modeling heterogeneity in data, for both regression and classification problems in statistics and machine learning, due to their flexibility and the abundance of available statistical estimation and model choice tools. Such flexibility comes from allowing the mixture weights (or gating functions) in the MoE model to depend on the explanatory variables, along with the experts (or component densities). This permits the modeling of data arising from more complex data generating processes when compared to the classical finite mixtures and finite mixtures of regression models, whose mixing parameters are independent of the covariates. The use of MoE models in a high-dimensional setting, when the number of explanatory variables can be much larger than the sample size, is challenging from a computational point of view, and in particular from a theoretical point of view, where the literature is still lacking results for dealing with the curse of dimensionality, for both the statistical estimation and feature selection problems. We consider the finite MoE model with soft-max gating functions and Gaussian experts for high-dimensional regression on heterogeneous data, and its $l_1$-regularized estimation via the Lasso. We focus on the Lasso estimation properties rather than its feature selection properties. We provide a lower bound on the regularization parameter of the Lasso function that ensures an $l_1$-oracle inequality satisfied by the Lasso estimator according to the Kullback--Leibler loss.
    Social learning via actions in bandit environments. (arXiv:2205.06107v1 [econ.TH])
    I study a game of strategic exploration with private payoffs and public actions in a Bayesian bandit setting. In particular, I look at cascade equilibria, in which agents switch over time from the risky action to the riskless action only when they become sufficiently pessimistic. I show that these equilibria exist under some conditions and establish their salient properties. Individual exploration in these equilibria can be more or less than the single-agent level depending on whether the agents start out with a common prior or not, but the most optimistic agent always underexplores. I also show that allowing the agents to write enforceable ex-ante contracts will lead to the most ex-ante optimistic agent to buy all payoff streams, providing an explanation to the buying out of smaller start-ups by more established firms.  ( 2 min )
    The Implicit Bias of Benign Overfitting. (arXiv:2201.11489v3 [cs.LG] UPDATED)
    The phenomenon of benign overfitting, where a predictor perfectly fits noisy training data while attaining low expected loss, has received much attention in recent years, but still remains not fully understood beyond well-specified linear regression setups. In this paper, we provide several new results on when one can or cannot expect benign overfitting to occur, for both regression and classification tasks. We consider a prototypical and rather generic data model for benign overfitting of linear predictors, where an arbitrary input distribution of some fixed dimension $k$ is concatenated with a high-dimensional distribution. For linear regression which is not necessarily well-specified, we show that the minimum-norm interpolating predictor (that standard training methods converge to) is biased towards an inconsistent solution in general, hence benign overfitting will generally not occur. Moreover, we show how this can be extended beyond standard linear regression, by an argument proving how the existence of benign overfitting on some regression problems precludes its existence on other regression problems. We then turn to classification problems, and show that the situation there is much more favorable. Specifically, we prove that the max-margin predictor (to which standard training methods are known to converge in direction) is asymptotically biased towards minimizing a weighted squared hinge loss. This allows us to reduce the question of benign overfitting in classification to the simpler question of whether this loss is a good surrogate for the misclassification error, and use it to show benign overfitting in some new settings.
    Healthy Twitter discussions? Time will tell. (arXiv:2203.11261v2 [cs.SI] UPDATED)
    Studying misinformation and how to deal with unhealthy behaviours within online discussions has recently become an important field of research within social studies. With the rapid development of social media, and the increasing amount of available information and sources, rigorous manual analysis of such discourses has become unfeasible. Many approaches tackle the issue by studying the semantic and syntactic properties of discussions following a supervised approach, for example using natural language processing on a dataset labeled for abusive, fake or bot-generated content. Solutions based on the existence of a ground truth are limited to those domains which may have ground truth. However, within the context of misinformation, it may be difficult or even impossible to assign labels to instances. In this context, we consider the use of temporal dynamic patterns as an indicator of discussion health. Working in a domain for which ground truth was unavailable at the time (early COVID-19 pandemic discussions) we explore the characterization of discussions based on the the volume and time of contributions. First we explore the types of discussions in an unsupervised manner, and then characterize these types using the concept of ephemerality, which we formalize. In the end, we discuss the potential use of our ephemerality definition for labeling online discourses based on how desirable, healthy and constructive they are.
    Neural Network-based OFDM Receiver for Resource Constrained IoT Devices. (arXiv:2205.06159v1 [eess.SP])
    Orthogonal Frequency Division Multiplexing (OFDM)-based waveforms are used for communication links in many current and emerging Internet of Things (IoT) applications, including the latest WiFi standards. For such OFDM-based transceivers, many core physical layer functions related to channel estimation, demapping, and decoding are implemented for specific choices of channel types and modulation schemes, among others. To decouple hard-wired choices from the receiver chain and thereby enhance the flexibility of IoT deployment in many novel scenarios without changing the underlying hardware, we explore a novel, modular Machine Learning (ML)-based receiver chain design. Here, ML blocks replace the individual processing blocks of an OFDM receiver, and we specifically describe this swapping for the legacy channel estimation, symbol demapping, and decoding blocks with Neural Networks (NNs). A unique aspect of this modular design is providing flexible allocation of processing functions to the legacy or ML blocks, allowing them to interchangeably coexist. Furthermore, we study the implementation cost-benefits of the proposed NNs in resource-constrained IoT devices through pruning and quantization, as well as emulation of these compressed NNs within Field Programmable Gate Arrays (FPGAs). Our evaluations demonstrate that the proposed modular NN-based receiver improves bit error rate of the traditional non-ML receiver by averagely 61% and 10% for the simulated and over-the-air datasets, respectively. We further show complexity-performance tradeoffs by presenting computational complexity comparisons between the traditional algorithms and the proposed compressed NNs.
    Improved Sample Complexity Bounds for Branch-and-Cut. (arXiv:2111.11207v2 [cs.LG] UPDATED)
    Branch-and-cut is the most widely used algorithm for solving integer programs, employed by commercial solvers like CPLEX and Gurobi. Branch-and-cut has a wide variety of tunable parameters that have a huge impact on the size of the search tree that it builds, but are challenging to tune by hand. An increasingly popular approach is to use machine learning to tune these parameters: using a training set of integer programs from the application domain at hand, the goal is to find a configuration with strong predicted performance on future, unseen integer programs from the same domain. If the training set is too small, a configuration may have good performance over the training set but poor performance on future integer programs. In this paper, we prove sample complexity guarantees for this procedure, which bound how large the training set should be to ensure that for any configuration, its average performance over the training set is close to its expected future performance. Our guarantees apply to parameters that control the most important aspects of branch-and-cut: node selection, branching constraint selection, and cutting plane selection, and are sharper and more general than those found in prior research.
    AiSocrates: Towards Answering Ethical Quandary Questions. (arXiv:2205.05989v1 [cs.CL])
    Considerable advancements have been made in various NLP tasks based on the impressive power of large pre-trained language models (LLMs). These results have inspired efforts to understand the limits of LLMs so as to evaluate how far we are from achieving human level general natural language understanding. In this work, we challenge the capability of LLMs with the new task of Ethical Quandary Generative Question Answering. Ethical quandary questions are more challenging to address because multiple conflicting answers may exist to a single quandary. We propose a system, AiSocrates, that provides an answer with a deliberative exchange of different perspectives to an ethical quandary, in the approach of Socratic philosophy, instead of providing a closed answer like an oracle. AiSocrates searches for different ethical principles applicable to the ethical quandary and generates an answer conditioned on the chosen principles through prompt-based few-shot learning. We also address safety concerns by providing a human controllability option in choosing ethical principles. We show that AiSocrates generates promising answers to ethical quandary questions with multiple perspectives, 6.92% more often than answers written by human philosophers by one measure, but the system still needs improvement to match the coherence of human philosophers fully. We argue that AiSocrates is a promising step toward developing an NLP system that incorporates human values explicitly by prompt instructions. We are releasing the code for research purposes.
    Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. (arXiv:2104.13921v3 [cs.CV] UPDATED)
    We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD obtains 16.1 mask AP$_r$ with a ResNet-50 backbone, even outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP$_{50}$ on PASCAL VOC, 36.6 AP on COCO and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. Code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.
    Fair NLP Models with Differentially Private Text Encoders. (arXiv:2205.06135v1 [cs.CL])
    Encoded text representations often capture sensitive attributes about individuals (e.g., race or gender), which raise privacy concerns and can make downstream models unfair to certain groups. In this work, we propose FEDERATE, an approach that combines ideas from differential privacy and adversarial training to learn private text representations which also induces fairer models. We empirically evaluate the trade-off between the privacy of the representations and the fairness and accuracy of the downstream model on four NLP datasets. Our results show that FEDERATE consistently improves upon previous methods, and thus suggest that privacy and fairness can positively reinforce each other.  ( 2 min )
    Energy-Based Learning for Cooperative Games, with Applications to Valuation Problems in Machine Learning. (arXiv:2106.02938v4 [cs.LG] UPDATED)
    Valuation problems, such as feature interpretation, data valuation and model valuation for ensembles, become increasingly more important in many machine learning applications. Such problems are commonly solved by well-known game-theoretic criteria, such as Shapley value or Banzhaf value. In this work, we present a novel energy-based treatment for cooperative games, with a theoretical justification by the maximum entropy framework. Surprisingly, by conducting variational inference of the energy-based model, we recover various game-theoretic valuation criteria through conducting one-step fixed point iteration for maximizing the mean-field ELBO objective. This observation also verifies the rationality of existing criteria, as they are all attempting to decouple the correlations among the players through the mean-field approach. By running fixed point iteration for multiple steps, we achieve a trajectory of the valuations, among which we define the valuation with the best conceivable decoupling error as the Variational Index. We prove that under uniform initializations, these variational valuations all satisfy a set of game-theoretic axioms. We experimentally demonstrate that the proposed Variational Index enjoys lower decoupling error and better valuation performance on certain synthetic and real-world valuation problems.  ( 2 min )
    Generating Fair Universal Representations using Adversarial Models. (arXiv:1910.00411v7 [cs.LG] UPDATED)
    We present a data-driven framework for learning fair universal representations (FUR) that guarantee statistical fairness for any learning task that may not be known a priori. Our framework leverages recent advances in adversarial learning to allow a data holder to learn representations in which a set of sensitive attributes are decoupled from the rest of the dataset. We formulate this as a constrained minimax game between an encoder and an adversary where the constraint ensures a measure of usefulness (utility) of the representation. The resulting problem is that of censoring, i.e., finding a representation that is least informative about the sensitive attributes given a utility constraint. For appropriately chosen adversarial loss functions, our censoring framework precisely clarifies the optimal adversarial strategy against strong information-theoretic adversaries; it also achieves the fairness measure of demographic parity for the resulting constrained representations. We evaluate the performance of our proposed framework on both synthetic and publicly available datasets. For these datasets, we use two tradeoff measures: censoring vs. representation fidelity and fairness vs. utility for downstream tasks, to amply demonstrate that multiple sensitive features can be effectively censored even as the resulting fair representations ensure accuracy for multiple downstream tasks.  ( 2 min )
    So Cloze yet so Far: N400 Amplitude is Better Predicted by Distributional Information than Human Predictability Judgements. (arXiv:2109.01226v2 [cs.CL] UPDATED)
    More predictable words are easier to process - they are read faster and elicit smaller neural signals associated with processing difficulty, most notably, the N400 component of the event-related brain potential. Thus, it has been argued that prediction of upcoming words is a key component of language comprehension, and that studying the amplitude of the N400 is a valuable way to investigate the predictions we make. In this study, we investigate whether the linguistic predictions of computational language models or humans better reflect the way in which natural language stimuli modulate the amplitude of the N400. One important difference in the linguistic predictions of humans versus computational language models is that while language models base their predictions exclusively on the preceding linguistic context, humans may rely on other factors. We find that the predictions of three top-of-the-line contemporary language models - GPT-3, RoBERTa, and ALBERT - match the N400 more closely than human predictions. This suggests that the predictive processes underlying the N400 may be more sensitive to the surface-level statistics of language than previously thought.  ( 2 min )
    Efficient Federated Learning for AIoT Applications Using Knowledge Distillation. (arXiv:2111.14347v2 [cs.LG] UPDATED)
    As a promising distributed machine learning paradigm, Federated Learning (FL) trains a central model with decentralized data without compromising user privacy, which has made it widely used by Artificial Intelligence Internet of Things (AIoT) applications. However, the traditional FL suffers from model inaccuracy since it trains local models using hard labels of data and ignores useful information of incorrect predictions with small probabilities. Although various solutions try to tackle the bottleneck of the traditional FL, most of them introduce significant communication and memory overhead, making the deployment of large-scale AIoT devices a great challenge. To address the above problem, this paper presents a novel Distillation-based Federated Learning (DFL) architecture that enables efficient and accurate FL for AIoT applications. Inspired by Knowledge Distillation (KD) that can increase the model accuracy, our approach adds the soft targets used by KD to the FL model training, which occupies negligible network resources. The soft targets are generated by local sample predictions of each AIoT device after each round of local training and used for the next round of model training. During the local training of DFL, both soft targets and hard labels are used as approximation objectives of model predictions to improve model accuracy by supplementing the knowledge of soft targets. To further improve the performance of our DFL model, we design a dynamic adjustment strategy for tuning the ratio of two loss functions used in KD, which can maximize the use of both soft targets and hard labels. Comprehensive experimental results on well-known benchmarks show that our approach can significantly improve the model accuracy of FL with both Independent and Identically Distributed (IID) and non-IID data.  ( 2 min )
    Training Uncertainty-Aware Classifiers with Conformalized Deep Learning. (arXiv:2205.05878v1 [stat.ML])
    Deep neural networks are powerful tools to detect hidden patterns in data and leverage them to make predictions, but they are not designed to understand uncertainty and estimate reliable probabilities. In particular, they tend to be overconfident. We address this problem by developing a novel training algorithm that can lead to more dependable uncertainty estimates, without sacrificing predictive power. The idea is to mitigate overconfidence by minimizing a loss function, inspired by advances in conformal inference, that quantifies model uncertainty by carefully leveraging hold-out data. Experiments with synthetic and real data demonstrate this method leads to smaller conformal prediction sets with higher conditional coverage, after exact calibration with hold-out data, compared to state-of-the-art alternatives.  ( 2 min )
    Performing Video Frame Prediction of Microbial Growth with a Recurrent Neural Network. (arXiv:2205.05810v1 [cs.LG])
    A Recurrent Neural Network (RNN) was used to perform video frame prediction of microbial growth for a population of two mutants of Pseudomonas aeruginosa. The RNN was trained on videos of 20 frames that were acquired using fluorescence microscopy and microfluidics. The network predicted the last 10 frames of each video, and the accuracy's of the predictions was assessed by comparing raw images, population curves, and the number and size of individual colonies. Overall, we found the predictions to be accurate using this approach. The implications this result has on designing autonomous experiments in microbiology, and the steps that can be taken to make the predictions even more accurate, are discussed.  ( 2 min )
    Subgroup discovery of Parkinson's Disease by utilizing a multi-modal smart device system. (arXiv:2205.05961v1 [cs.LG])
    In recent years, sensors from smart consumer devices have shown great diagnostic potential in movement disorders. In this context, data modalities such as electronic questionnaires, hand movement and voice captures have successfully captured biomarkers and allowed discrimination between Parkinson's disease (PD) and healthy controls (HC) or differential diagnosis (DD). However, to the best of our knowledge, a comprehensive evaluation of assessments with a multi-modal smart device system has still been lacking. In a prospective study exploring PD, we used smartwatches and smartphones to collect multi-modal data from 504 participants, including PD patients, DD and HC. This study aims to assess the effect of multi-modal vs. single-modal data on PD vs. HC and PD vs. DD classification, as well as on PD group clustering for subgroup identification. We were able to show that by combining various modalities, classification accuracy improved and further PD clusters were discovered.  ( 2 min )
    Topologically-Aware Deformation Fields for Single-View 3D Reconstruction. (arXiv:2205.06267v1 [cs.CV])
    We present a new framework for learning 3D object shapes and dense cross-object 3D correspondences from just an unaligned category-specific image collection. The 3D shapes are generated implicitly as deformations to a category-specific signed distance field and are learned in an unsupervised manner solely from unaligned image collections without any 3D supervision. Generally, image collections on the internet contain several intra-category geometric and topological variations, for example, different chairs can have different topologies, which makes the task of joint shape and correspondence estimation much more challenging. Because of this, prior works either focus on learning each 3D object shape individually without modeling cross-instance correspondences or perform joint shape and correspondence estimation on categories with minimal intra-category topological variations. We overcome these restrictions by learning a topologically-aware implicit deformation field that maps a 3D point in the object space to a higher dimensional point in the category-specific canonical space. At inference time, given a single image, we reconstruct the underlying 3D shape by first implicitly deforming each 3D point in the object space to the learned category-specific canonical space using the topologically-aware deformation field and then reconstructing the 3D shape as a canonical signed distance field. Both canonical shape and deformation field are learned end-to-end in an inverse-graphics fashion using a learned recurrent ray marcher (SRN) as a differentiable rendering module. Our approach, dubbed TARS, achieves state-of-the-art reconstruction fidelity on several datasets: ShapeNet, Pascal3D+, CUB, and Pix3D chairs. Result videos and code at https://shivamduggal4.github.io/tars-3D/  ( 2 min )
    Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers. (arXiv:2205.05055v2 [cs.AI] CROSS LISTED)
    Large transformer-based language models are able to perform few-shot learning (also known as in-context learning), without having been explicitly trained for it. We hypothesized that specific distributional properties of natural language might drive this emergent phenomenon, as these characteristics might lead to a kind of interpolation between few-shot meta-training (designed to elicit rapid few-shot learning) and standard supervised training (designed to elicit gradual in-weights learning). We also hypothesized that these distributional properties could lead to emergent few-shot learning in domains outside of language. Inspired by this idea, we ran a series of experiments on a standard image-based few-shot dataset. We discovered that a number of data properties did indeed promote the emergence of few-shot learning in transformer models. All of these properties are present in natural language -- burstiness, long-tailedness, and many-to-one or one-to-many label mappings. The data influenced whether models were biased towards either few-shot learning vs. memorizing information in their weights; models could generally perform well at only one or the other. However, we discovered that an additional distributional property could allow the two capabilities to co-exist in the same model -- a skewed, Zipfian distribution over classes -- which occurs in language as well. Notably, training data that could elicit few-shot learning in transformers were unable to elicit few-shot learning in recurrent models. In sum, we find that few-shot learning emerges only from applying the right architecture to the right data distribution; neither component is sufficient on its own.  ( 2 min )
    Automatic Segmentation of Head and Neck Tumor: How Powerful Transformers Are?. (arXiv:2201.06251v2 [eess.IV] UPDATED)
    Cancer is one of the leading causes of death worldwide, and head and neck (H&N) cancer is amongst the most prevalent types. Positron emission tomography and computed tomography are used to detect, segment and quantify the tumor region. Clinically, tumor segmentation is extensively time-consuming and prone to error. Machine learning, and deep learning in particular, can assist to automate this process, yielding results as accurate as the results of a clinician. In this paper, we investigate a vision transformer-based method to automatically delineate H&N tumor, and compare its results to leading convolutional neural network (CNN)-based models. We use multi-modal data from CT and PET scans to perform the segmentation task. We show that a solution with a transformer-based model has the potential to achieve comparable results to CNN-based ones. With cross validation, the model achieves a mean dice similarity coefficient (DSC) of 0.736, mean precision of 0.766 and mean recall of 0.766. This is only 0.021 less than the 2020 competition winning model (cross validated in-house) in terms of the DSC score. On the testing set, the model performs similarly, with DSC of 0.736, precision of 0.773, and recall of 0.760, which is only 0.023 lower in DSC than the 2020 competition winning model. This work shows that cancer segmentation via transformer-based models is a promising research area to further explore.  ( 2 min )
    kNN-Embed: Locally Smoothed Embedding Mixtures For Multi-interest Candidate Retrieval. (arXiv:2205.06205v1 [cs.IR])
    Candidate generation is the first stage in recommendation systems, where a light-weight system is used to retrieve potentially relevant items for an input user. These candidate items are then ranked and pruned in later stages of recommender systems using a more complex ranking model. Since candidate generation is the top of the recommendation funnel, it is important to retrieve a high-recall candidate set to feed into downstream ranking models. A common approach for candidate generation is to leverage approximate nearest neighbor (ANN) search from a single dense query embedding; however, this approach this can yield a low-diversity result set with many near duplicates. As users often have multiple interests, candidate retrieval should ideally return a diverse set of candidates reflective of the user's multiple interests. To this end, we introduce kNN-Embed, a general approach to improving diversity in dense ANN-based retrieval. kNN-Embed represents each user as a smoothed mixture over learned item clusters that represent distinct `interests' of the user. By querying each of a user's mixture component in proportion to their mixture weights, we retrieve a high-diversity set of candidates reflecting elements from each of a user's interests. We experimentally compare kNN-Embed to standard ANN candidate retrieval, and show significant improvements in overall recall and improved diversity across three datasets. Accompanying this work, we open source a large Twitter follow-graph dataset, to spur further research in graph-mining and representation learning for recommender systems.  ( 2 min )
    A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection. (arXiv:2205.05684v1 [eess.AS])
    Audio-visual automatic speech recognition is a promising approach to robust ASR under noisy conditions. However, up until recently it had been traditionally studied in isolation assuming the video of a single speaking face matches the audio, and selecting the active speaker at inference time when multiple people are on screen was put aside as a separate problem. As an alternative, recent work has proposed to address the two problems simultaneously with an attention mechanism, baking the speaker selection problem directly into a fully differentiable model. One interesting finding was that the attention indirectly learns the association between the audio and the speaking face even though this correspondence is never explicitly provided at training time. In the present work we further investigate this connection and examine the interplay between the two problems. With experiments involving over 50 thousand hours of public YouTube videos as training data, we first evaluate the accuracy of the attention layer on an active speaker selection task. Secondly, we show under closer scrutiny that an end-to-end model performs at least as well as a considerably larger two-step system that utilizes a hard decision boundary under various noise conditions and number of parallel face tracks.  ( 2 min )
    Image Segmentation with Topological Priors. (arXiv:2205.06197v1 [cs.CV])
    Solving segmentation tasks with topological priors proved to make fewer errors in fine-scale structures. In this work, we use topological priors both before and during the deep neural network training procedure. We compared the results of the two approaches with simple segmentation on various accuracy metrics and the Betti number error, which is directly related to topological correctness, and discovered that incorporating topological information into the classical UNet model performed significantly better. We conducted experiments on the ISBI EM segmentation dataset.  ( 2 min )
    Fighting Money Laundering with Statistics and Machine Learning: An Introduction and Review. (arXiv:2201.04207v3 [stat.ML] UPDATED)
    Money laundering is a profound global problem. Nonetheless, there is little statistical and machine learning research on the topic. In this paper, we focus on anti-money laundering in banks. To help organize existing research, we propose a unifying terminology and provide a review of the literature. This is structured around two central tasks: (i) client risk profiling and (ii) suspicious behavior flagging. We find that client risk profiling is characterized by diagnostics, i.e., efforts to find and explain risk factors. Suspicious behavior flagging, on the other hand, is characterized by non-disclosed features and hand-crafted risk indices. Finally, we discuss directions for future research. One major challenge is a lack of public data sets. This may, potentially, be addressed by synthetic data generation. Other possible research directions include semi-supervised and deep learning, interpretability, and fairness of the results.  ( 2 min )
    SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients. (arXiv:2106.08208v10 [math.OC] UPDATED)
    Adaptive gradient methods have shown excellent performances for solving many machine learning problems. Although multiple adaptive gradient methods were recently studied, they mainly focus on either empirical or theoretical aspects and also only work for specific problems by using some specific adaptive learning rates. Thus, it is desired to design a universal framework for practical algorithms of adaptive gradients with theoretical guarantee to solve general problems. To fill this gap, we propose a faster and universal framework of adaptive gradients (i.e., SUPER-ADAM) by introducing a universal adaptive matrix that includes most existing adaptive gradient forms. Moreover, our framework can flexibly integrate the momentum and variance reduced techniques. In particular, our novel framework provides the convergence analysis support for adaptive gradient methods under the nonconvex setting. In theoretical analysis, we prove that our SUPER-ADAM algorithm can achieve the best known gradient (i.e., stochastic first-order oracle (SFO)) complexity of $\tilde{O}(\epsilon^{-3})$ for finding an $\epsilon$-stationary point of nonconvex optimization, which matches the lower bound for stochastic smooth nonconvex optimization. In numerical experiments, we employ various deep learning tasks to validate that our algorithm consistently outperforms the existing adaptive algorithms. Code is available at https://github.com/LIJUNYI95/SuperAdam  ( 2 min )
    One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code. (arXiv:2205.06126v1 [cs.CL])
    People perceive the world with multiple senses (e.g., through hearing sounds, reading words and seeing objects). However, most existing AI systems only process an individual modality. This paper presents an approach that excels at handling multiple modalities of information with a single model. In our "{SkillNet}" model, different parts of the parameters are specialized for processing different modalities. Unlike traditional dense models that always activate all the model parameters, our model sparsely activates parts of the parameters whose skills are relevant to the task. Such model design enables SkillNet to learn skills in a more interpretable way. We develop our model for five modalities including text, image, sound, video and code. Results show that, SkillNet performs comparably to five modality-specific fine-tuned models. Moreover, our model supports self-supervised pretraining with the same sparsely activated way, resulting in better initialized parameters for different modalities. We find that pretraining significantly improves the performance of SkillNet on five modalities, on par with or even better than baselines with modality-specific pretraining. On the task of Chinese text-to-image retrieval, our final system achieves higher accuracy than existing leading systems including Wukong{ViT-B} and Wenlan 2.0 while using less number of activated parameters.  ( 2 min )
    Controlling chaotic itinerancy in laser dynamics for reinforcement learning. (arXiv:2205.05987v1 [physics.optics])
    Photonic artificial intelligence has attracted considerable interest in accelerating machine learning; however, the unique optical properties have not been fully utilized for achieving higher-order functionalities. Chaotic itinerancy, with its spontaneous transient dynamics among multiple quasi-attractors, can be employed to realize brain-like functionalities. In this paper, we propose a method for controlling the chaotic itinerancy in a multi-mode semiconductor laser to solve a machine learning task, known as the multi-armed bandit problem, which is fundamental to reinforcement learning. The proposed method utilizes ultrafast chaotic itinerant motion in mode competition dynamics controlled via optical injection. We found that the exploration mechanism is completely different from a conventional searching algorithm and is highly scalable, outperforming the conventional approaches for large-scale bandit problems. This study paves the way to utilize chaotic itinerancy for effectively solving complex machine learning tasks as photonic hardware accelerators.  ( 2 min )
    Algebraic Machine Learning with an Application to Chemistry. (arXiv:2205.05795v1 [math.AG])
    As data used in scientific application become more complex, studying their geometry and topology has become an increasingly prevalent part of the data analysis process. This can be seen for example with the growing interest in topological tools such as persistent homology. However, on the one hand, topological tools are inherently limited to providing only coarse information about the underlying space of the data. On the other hand, more geometric approaches rely predominately on the manifold hypothesis, which asserts that the underlying space is a smooth manifold. This assumption fails for many physical models where the underlying space contains singularities. In this paper we develop a machine learning pipeline that captures fine-grain geometric information without having to rely on any smoothness assumptions. Our approach involves working within the scope of algebraic geometry and algebraic varieties instead of differential geometry and smooth manifolds. In the setting of the variety hypothesis, the learning problem becomes to find the underlying variety using sample data. We cast this learning problem into a Maximum A Posteriori optimization problem which we solve in terms of an eigenvalue computation. Having found the underlying variety, we explore the use of Gr\"obner bases and numerical methods to reveal information about its geometry. In particular, we propose a heuristic for numerically detecting points lying near the singular locus of the underlying variety.  ( 2 min )
    Privacy-Preserving Distributed Machine Learning Made Faster. (arXiv:2205.05825v1 [cs.CR])
    With the development of machine learning, it is difficult for a single server to process all the data. So machine learning tasks need to be spread across multiple servers, turning the centralized machine learning into a distributed one. However, privacy remains an unsolved problem in distributed machine learning. Multi-key homomorphic encryption is one of the suitable candidates to solve the problem. However, the most recent result of the Multi-key homomorphic encryption scheme (MKTFHE) only supports the NAND gate. Although it is Turing complete, it requires efficient encapsulation of the NAND gate to further support mathematical calculation. This paper designs and implements a series of operations on positive and negative integers accurately. First, we design basic bootstrapped gates with the same efficiency as that of the NAND gate. Second, we construct practical $k$-bit complement mathematical operators based on our basic binary bootstrapped gates. The constructed created can perform addition, subtraction, multiplication, and division on both positive and negative integers. Finally, we demonstrated the generality of the designed operators by achieving a distributed privacy-preserving machine learning algorithm, i.e. linear regression with two different solutions. Experiments show that the operators we designed are practical and efficient.  ( 2 min )
    Long Story Short: Omitted Variable Bias in Causal Machine Learning. (arXiv:2112.13398v3 [econ.EM] UPDATED)
    We derive general, yet simple, sharp bounds on the size of the omitted variable bias for a broad class of causal parameters that can be identified as linear functionals of the conditional expectation function of the outcome. Such functionals encompass many of the traditional targets of investigation in causal inference studies, such as, for example, (weighted) average of potential outcomes, average treatment effects (including subgroup effects, such as the effect on the treated), (weighted) average derivatives, and policy effects from shifts in covariate distribution -- all for general, nonparametric causal models. Our construction relies on the Riesz-Frechet representation of the target functional. Specifically, we show how the bound on the bias depends only on the additional variation that the latent variables create both in the outcome and in the Riesz representer for the parameter of interest. Moreover, in many important cases (e.g, average treatment effects and avearage derivatives) the bound is shown to depend on easily interpretable quantities that measure the explanatory power of the omitted variables. Therefore, simple plausibility judgments on the maximum explanatory power of omitted variables (in explaining treatment and outcome variation) are sufficient to place overall bounds on the size of the bias. Furthermore, we use debiased machine learning to provide flexible and efficient statistical inference on learnable components of the bounds. Finally, empirical examples demonstrate the usefulness of the approach.  ( 2 min )
    Orthogonal Gromov-Wasserstein Discrepancy with Efficient Lower Bound. (arXiv:2205.05838v1 [cs.LG])
    Comparing structured data from possibly different metric-measure spaces is a fundamental task in machine learning, with applications in, e.g., graph classification. The Gromov-Wasserstein (GW) discrepancy formulates a coupling between the structured data based on optimal transportation, tackling the incomparability between different structures by aligning the intra-relational geometries. Although efficient local solvers such as conditional gradient and Sinkhorn are available, the inherent non-convexity still prevents a tractable evaluation, and the existing lower bounds are not tight enough for practical use. To address this issue, we take inspiration from the connection with the quadratic assignment problem, and propose the orthogonal Gromov-Wasserstein (OGW) discrepancy as a surrogate of GW. It admits an efficient and closed-form lower bound with the complexity of $\mathcal{O}(n^3)$, and directly extends to the fused Gromov-Wasserstein (FGW) distance, incorporating node features into the coupling. Extensive experiments on both the synthetic and real-world datasets show the tightness of our lower bounds, and both OGW and its lower bounds efficiently deliver accurate predictions and satisfactory barycenters for graph sets.  ( 2 min )
    GAN-DUF: Hierarchical Deep Generative Models for Design Under Free-Form Geometric Uncertainty. (arXiv:2202.10558v3 [cs.CE] UPDATED)
    Deep generative models have demonstrated effectiveness in learning compact and expressive design representations that significantly improve geometric design optimization. However, these models do not consider the uncertainty introduced by manufacturing or fabrication. Past work that quantifies such uncertainty often makes simplifying assumptions on geometric variations, while the "real-world", "free-form" uncertainty and its impact on design performance are difficult to quantify due to the high dimensionality. To address this issue, we propose a Generative Adversarial Network-based Design under Uncertainty Framework (GAN-DUF), which contains a deep generative model that simultaneously learns a compact representation of nominal (ideal) designs and the conditional distribution of fabricated designs given any nominal design. This opens up new possibilities of 1)~building a universal uncertainty quantification model compatible with both shape and topological designs, 2)~modeling free-form geometric uncertainties without the need to make any assumptions on the distribution of geometric variability, and 3)~allowing fast prediction of uncertainties for new nominal designs. We can combine the proposed deep generative model with robust design optimization or reliability-based design optimization for design under uncertainty. We demonstrated the framework on two real-world engineering design examples and showed its capability of finding the solution that possesses better performances after fabrication.  ( 2 min )
    Leveraging Uncertainty for Deep Interpretable Classification and Weakly-Supervised Segmentation of Histology Images. (arXiv:2205.05841v1 [eess.IV])
    Trained using only image class label, deep weakly supervised methods allow image classification and ROI segmentation for interpretability. Despite their success on natural images, they face several challenges over histology data where ROI are visually similar to background making models vulnerable to high pixel-wise false positives. These methods lack mechanisms for modeling explicitly non-discriminative regions which raises false-positive rates. We propose novel regularization terms, which enable the model to seek both non-discriminative and discriminative regions, while discouraging unbalanced segmentations and using only image class label. Our method is composed of two networks: a localizer that yields segmentation mask, followed by a classifier. The training loss pushes the localizer to build a segmentation mask that holds most discrimiantive regions while simultaneously modeling background regions. Comprehensive experiments over two histology datasets showed the merits of our method in reducing false positives and accurately segmenting ROI.  ( 2 min )
    Comments on: "Hybrid Semiparametric Bayesian Networks". (arXiv:2205.05910v1 [stat.ME])
    Invited discussion on the paper "Hybrid Semiparametric Bayesian Networks" by David Atienza, Pedro Larranaga and Concha Bielza (TEST, 2022).  ( 2 min )
    Accounting for the Sequential Nature of States to Learn Features for Reinforcement Learning. (arXiv:2205.06000v1 [cs.LG])
    In this work, we investigate the properties of data that cause popular representation learning approaches to fail. In particular, we find that in environments where states do not significantly overlap, variational autoencoders (VAEs) fail to learn useful features. We demonstrate this failure in a simple gridworld domain, and then provide a solution in the form of metric learning. However, metric learning requires supervision in the form of a distance function, which is absent in reinforcement learning. To overcome this, we leverage the sequential nature of states in a replay buffer to approximate a distance metric and provide a weak supervision signal, under the assumption that temporally close states are also semantically similar. We modify a VAE with triplet loss and demonstrate that this approach is able to learn useful features for downstream tasks, without additional supervision, in environments where standard VAEs fail.  ( 2 min )
    Secure Aggregation for Federated Learning in Flower. (arXiv:2205.06117v1 [cs.LG])
    Federated Learning (FL) allows parties to learn a shared prediction model by delegating the training computation to clients and aggregating all the separately trained models on the server. To prevent private information being inferred from local models, Secure Aggregation (SA) protocols are used to ensure that the server is unable to inspect individual trained models as it aggregates them. However, current implementations of SA in FL frameworks have limitations, including vulnerability to client dropouts or configuration difficulties. In this paper, we present Salvia, an implementation of SA for Python users in the Flower FL framework. Based on the SecAgg(+) protocols for a semi-honest threat model, Salvia is robust against client dropouts and exposes a flexible and easy-to-use API that is compatible with various machine learning frameworks. We show that Salvia's experimental performance is consistent with SecAgg(+)'s theoretical computation and communication complexities.  ( 2 min )
    A Generalist Agent. (arXiv:2205.06175v1 [cs.AI])
    Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.  ( 2 min )
    Reducing Overconfidence Predictions for Autonomous Driving Perception. (arXiv:2202.07825v2 [cs.CV] UPDATED)
    In state-of-the-art deep learning for object recognition, SoftMax and Sigmoid functions are most commonly employed as the predictor outputs. Such layers often produce overconfident predictions rather than proper probabilistic scores, which can thus harm the decision-making of `critical' perception systems applied in autonomous driving and robotics. Given this, the experiments in this work propose a probabilistic approach based on distributions calculated out of the Logit layer scores of pre-trained networks. We demonstrate that Maximum Likelihood (ML) and Maximum a-Posteriori (MAP) functions are more suitable for probabilistic interpretations than SoftMax and Sigmoid-based predictions for object recognition. We explore distinct sensor modalities via RGB images and LiDARs (RV: range-view) data from the KITTI and Lyft Level-5 datasets, where our approach shows promising performance compared to the usual SoftMax and Sigmoid layers, with the benefit of enabling interpretable probabilistic predictions. Another advantage of the approach introduced in this paper is that the ML and MAP functions can be implemented in existing trained networks, that is, the approach benefits from the output of the Logit layer of pre-trained networks. Thus, there is no need to carry out a new training phase since the ML and MAP functions are used in the test/prediction phase.  ( 2 min )
    Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering. (arXiv:2204.04581v2 [cs.CL] UPDATED)
    Retrieval augmented language models have recently become the standard for knowledge intensive tasks. Rather than relying purely on latent semantics within the parameters of large neural models, these methods enlist a semi-parametric memory to encode an index of knowledge for the model to retrieve over. Most prior work has employed text passages as the unit of knowledge, which has high coverage at the cost of interpretability, controllability, and efficiency. The opposite properties arise in other methods which have instead relied on knowledge base (KB) facts. At the same time, more recent work has demonstrated the effectiveness of storing and retrieving from an index of Q-A pairs derived from text \citep{lewis2021paq}. This approach yields a high coverage knowledge representation that maintains KB-like properties due to its representations being more atomic units of information. In this work we push this line of research further by proposing a question-answer augmented encoder-decoder model and accompanying pretraining strategy. This yields an end-to-end system that not only outperforms prior QA retrieval methods on single-hop QA tasks but also enables compositional reasoning, as demonstrated by strong performance on two multi-hop QA datasets. Together, these methods improve the ability to interpret and control the model while narrowing the performance gap with passage retrieval systems.  ( 2 min )
    Adversarial Estimators. (arXiv:2204.10495v2 [econ.EM] UPDATED)
    We develop an asymptotic theory of adversarial estimators ('A-estimators'). They generalize maximum-likelihood-type estimators ('M-estimators') as their objective is maximized by some parameters and minimized by others. This class subsumes the continuous-updating Generalized Method of Moments, Generative Adversarial Networks and more recent proposals in machine learning and econometrics. In these examples, researchers state which aspects of the problem may in principle be used for estimation, and an adversary learns how to emphasize them optimally. We derive the convergence rates of A-estimators under pointwise and partial identification, and the normality of functionals of their parameters. Unknown functions may be approximated via sieves such as deep neural networks, for which we provide simplified low-level conditions. As a corollary, we obtain the normality of neural-net M-estimators, overcoming technical issues previously identified by the literature. Our theory yields novel results about a variety of A-estimators, providing intuition and formal justification for their success in recent applications.  ( 2 min )
    Dimension-adaptive machine-learning-based quantum state reconstruction. (arXiv:2205.05804v1 [quant-ph])
    We introduce an approach for performing quantum state reconstruction on systems of $n$ qubits using a machine-learning-based reconstruction system trained exclusively on $m$ qubits, where $m\geq n$. This approach removes the necessity of exactly matching the dimensionality of a system under consideration with the dimension of a model used for training. We demonstrate our technique by performing quantum state reconstruction on randomly sampled systems of one, two, and three qubits using machine-learning-based methods trained exclusively on systems containing at least one additional qubit. The reconstruction time required for machine-learning-based methods scales significantly more favorably than the training time; hence this technique can offer an overall savings of resources by leveraging a single neural network for dimension-variable state reconstruction, obviating the need to train dedicated machine-learning systems for each Hilbert space.  ( 2 min )
    eFedDNN: Ensemble based Federated Deep Neural Networks for Trajectory Mode Inference. (arXiv:2205.05756v1 [cs.LG])
    As the most significant data source in smart mobility systems, GPS trajectories can help identify user travel mode. However, these GPS datasets may contain users' private information (e.g., home location), preventing many users from sharing their private information with a third party. Hence, identifying travel modes while protecting users' privacy is a significant issue. To address this challenge, we use federated learning (FL), a privacy-preserving machine learning technique that aims at collaboratively training a robust global model by accessing users' locally trained models but not their raw data. Specifically, we designed a novel ensemble-based Federated Deep Neural Network (eFedDNN). The ensemble method combines the outputs of the different models learned via FL by the users and shows an accuracy that surpasses comparable models reported in the literature. Extensive experimental studies on a real-world open-access dataset from Montreal demonstrate that the proposed inference model can achieve accurate identification of users' mode of travel without compromising privacy.  ( 2 min )
    Cross-domain Few-shot Meta-learning Using Stacking. (arXiv:2205.05831v1 [cs.CV])
    Cross-domain few-shot meta-learning (CDFSML) addresses learning problems where knowledge needs to be transferred from several source domains into an instance-scarce target domain with an explicitly different input distribution. Recently published CDFSML methods generally construct a "universal model" that combines knowledge of multiple source domains into one backbone feature extractor. This enables efficient inference but necessitates re-computation of the backbone whenever a new source domain is added. Moreover, state-of-the-art methods derive their universal model from a collection of backbones -- normally one for each source domain -- and the backbones may be constrained to have the same architecture as the universal model. We propose a CDFSML method that is inspired by the classic stacking approach to meta learning. It imposes no constraints on the backbones' architecture or feature shape and does not incur the computational overhead of (re-)computing a universal model. Given a target-domain task, it fine-tunes each backbone independently, uses cross-validation to extract meta training data from the task's instance-scarce support set, and learns a simple linear meta classifier from this data. We evaluate our stacking approach on the well-known Meta-Dataset benchmark, targeting image classification with convolutional neural networks, and show that it often yields substantially higher accuracy than competing methods.  ( 2 min )
    Open Vocabulary Extreme Classification Using Generative Models. (arXiv:2205.05812v1 [cs.CL])
    The extreme multi-label classification (XMC) task aims at tagging content with a subset of labels from an extremely large label set. The label vocabulary is typically defined in advance by domain experts and assumed to capture all necessary tags. However in real world scenarios this label set, although large, is often incomplete and experts frequently need to refine it. To develop systems that simplify this process, we introduce the task of open vocabulary XMC (OXMC): given a piece of content, predict a set of labels, some of which may be outside of the known tag set. Hence, in addition to not having training data for some labels - as is the case in zero-shot classification - models need to invent some labels on-the-fly. We propose GROOV, a fine-tuned seq2seq model for OXMC that generates the set of labels as a flat sequence and is trained using a novel loss independent of predicted label order. We show the efficacy of the approach, experimenting with popular XMC datasets for which GROOV is able to predict meaningful labels outside the given vocabulary while performing on par with state-of-the-art solutions for known labels.  ( 2 min )
    Deep-Learned Generators of Porosity Distributions Produced During Metal Additive Manufacturing. (arXiv:2205.05794v1 [cs.LG])
    Laser Powder Bed Fusion has become a widely adopted method for metal Additive Manufacturing (AM) due to its ability to mass produce complex parts with increased local control. However, AM produced parts can be subject to undesirable porosity, negatively influencing the properties of printed components. Thus, controlling porosity is integral for creating effective parts. A precise understanding of the porosity distribution is crucial for accurately simulating potential fatigue and failure zones. Previous research on generating synthetic porous microstructures have succeeded in generating parts with high density, isotropic porosity distributions but are often inapplicable to cases with sparser, boundary-dependent pore distributions. Our work bridges this gap by providing a method that considers these constraints by deconstructing the generation problem into its constitutive parts. A framework is introduced that combines Generative Adversarial Networks with Mallat Scattering Transform-based autocorrelation methods to construct novel realizations of the individual pore geometries and surface roughness, then stochastically reconstruct them to form realizations of a porous printed part. The generated parts are compared to the existing experimental porosity distributions based on statistical and dimensional metrics, such as nearest neighbor distances, pore volumes, pore anisotropies and scattering transform based auto-correlations.  ( 2 min )
    Bridging Model-based Safety and Model-free Reinforcement Learning through System Identification of Low Dimensional Linear Models. (arXiv:2205.05787v1 [cs.RO])
    Bridging model-based safety and model-free reinforcement learning (RL) for dynamic robots is appealing since model-based methods are able to provide formal safety guarantees, while RL-based methods are able to exploit the robot agility by learning from the full-order system dynamics. However, current approaches to tackle this problem are mostly restricted to simple systems. In this paper, we propose a new method to combine model-based safety with model-free reinforcement learning by explicitly finding a low-dimensional model of the system controlled by a RL policy and applying stability and safety guarantees on that simple model. We use a complex bipedal robot Cassie, which is a high dimensional nonlinear system with hybrid dynamics and underactuation, and its RL-based walking controller as an example. We show that a low-dimensional dynamical model is sufficient to capture the dynamics of the closed-loop system. We demonstrate that this model is linear, asymptotically stable, and is decoupled across control input in all dimensions. We further exemplify that such linearity exists even when using different RL control policies. Such results point out an interesting direction to understand the relationship between RL and optimal control: whether RL tends to linearize the nonlinear system during training in some cases. Furthermore, we illustrate that the found linear model is able to provide guarantees by safety-critical optimal control framework, e.g., Model Predictive Control with Control Barrier Functions, on an example of autonomous navigation using Cassie while taking advantage of the agility provided by the RL-based controller.  ( 2 min )
    Distinction Maximization Loss: Efficiently Improving Classification Accuracy, Uncertainty Estimation, and Out-of-Distribution Detection Simply Replacing the Loss and Calibrating. (arXiv:2205.05874v1 [cs.LG])
    Building robust deterministic deep neural networks is still a challenge. On the one hand, some approaches improve out-of-distribution detection at the cost of reducing classification accuracy in some situations. On the other hand, some methods simultaneously increase classification accuracy, out-of-distribution detection, and uncertainty estimation, but reduce inference efficiency, in addition to training the same model many times to tune hyperparameters. In this paper, we propose training deterministic deep neural networks using our DisMax loss, which works as a drop-in replacement for the commonly used SoftMax loss (i.e., the combination of the linear output layer, the SoftMax activation, and the cross-entropy loss). Starting from the IsoMax+ loss, we created novel logits that are based on the distance to all prototypes rather than just the one associated with the correct class. We also propose a novel way to augment images to construct what we call fractional probability regularization. Moreover, we propose a new score to perform out-of-distribution detection and a fast way to calibrate the network after training. Our experiments show that DisMax usually outperforms all current approaches simultaneously in classification accuracy, uncertainty estimation, inference efficiency, and out-of-distribution detection, avoiding hyperparameter tuning and repetitive model training. The code to replace the SoftMax loss with the DisMax loss and reproduce the results in this paper is available at https://github.com/dlmacedo/distinction-maximization-loss.  ( 2 min )
    Deep Learning and Synthetic Media. (arXiv:2205.05764v1 [cs.LG])
    Deep learning algorithms are rapidly changing the way in which audiovisual media can be produced. Synthetic audiovisual media generated with deep learning - often subsumed colloquially under the label "deepfakes" - have a number of impressive characteristics; they are increasingly trivial to produce, and can be indistinguishable from real sounds and images recorded with a sensor. Much attention has been dedicated to ethical concerns raised by this technological development. Here, I focus instead on a set of issues related to the notion of synthetic audiovisual media, its place within a broader taxonomy of audiovisual media, and how deep learning techniques differ from more traditional approaches to media synthesis. After reviewing important etiological features of deep learning pipelines for media manipulation and generation, I argue that "deepfakes" and related synthetic media produced with such pipelines do not merely offer incremental improvements over previous methods, but challenge traditional taxonomical distinctions, and pave the way for genuinely novel kinds of audiovisual media.  ( 2 min )
    Representation Learning for Context-Dependent Decision-Making. (arXiv:2205.05820v1 [cs.LG])
    Humans are capable of adjusting to changing environments flexibly and quickly. Empirical evidence has revealed that representation learning plays a crucial role in endowing humans with such a capability. Inspired by this observation, we study representation learning in the sequential decision-making scenario with contextual changes. We propose an online algorithm that is able to learn and transfer context-dependent representations and show that it significantly outperforms the existing ones that do not learn representations adaptively. As a case study, we apply our algorithm to the Wisconsin Card Sorting Task, a well-established test for the mental flexibility of humans in sequential decision-making. By comparing our algorithm with the standard Q-learning and Deep-Q learning algorithms, we demonstrate the benefits of adaptive representation learning.  ( 2 min )
    A Deep Learning Approach for Predicting Two-dimensional Soil Consolidation Using Physics-Informed Neural Networks (PINN). (arXiv:2205.05710v1 [cs.CE])
    Soil consolidation is closely related to seepage, stability, and settlement of geotechnical buildings and foundations, and directly affects the use and safety of superstructures. Nowadays, the unidirectional consolidation theory of soils is widely used in certain conditions and approximate calculations. The multi-directional theory of soil consolidation is more reasonable than the unidirectional theory in practical applications, but it is much more complicated in terms of index determination and solution. To address the above problem, in this paper, we propose a deep learning method using physics-informed neural networks (PINN) to predict the excess pore water pressure of two-dimensional soil consolidation. In the proposed method, (1) a fully connected neural network is constructed, (2) the computational domain, partial differential equation (PDE), and constraints are defined to generate data for model training, and (3) the PDE of two-dimensional soil consolidation and the model of the neural network is connected to reduce the loss of the model. The effectiveness of the proposed method is verified by comparison with the numerical solution of PDE for two-dimensional consolidation. Using this method, the excess pore water pressure could be predicted simply and efficiently. In addition, the method was applied to predict the soil excess pore water pressure in the foundation in a real case at Tianjin port, China. The proposed deep learning approach can be used to investigate the large and complex multi-directional soil consolidation.  ( 2 min )
    A Survey of Risk-Aware Multi-Armed Bandits. (arXiv:2205.05843v1 [stat.ML])
    In several applications such as clinical trials and financial portfolio optimization, the expected value (or the average reward) does not satisfactorily capture the merits of a drug or a portfolio. In such applications, risk plays a crucial role, and a risk-aware performance measure is preferable, so as to capture losses in the case of adverse events. This survey aims to consolidate and summarise the existing research on risk measures, specifically in the context of multi-armed bandits. We review various risk measures of interest, and comment on their properties. Next, we review existing concentration inequalities for various risk measures. Then, we proceed to defining risk-aware bandit problems, We consider algorithms for the regret minimization setting, where the exploration-exploitation trade-off manifests, as well as the best-arm identification setting, which is a pure exploration problem -- both in the context of risk-sensitive measures. We conclude by commenting on persisting challenges and fertile areas for future research.  ( 2 min )
    RITA: a Study on Scaling Up Generative Protein Sequence Models. (arXiv:2205.05789v1 [q-bio.QM])
    In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain: we evaluate RITA models in next amino acid prediction, zero-shot fitness, and enzyme function prediction, showing benefits from increased scale. We release the RITA models openly, to the benefit of the research community.  ( 2 min )
    Visualization Guidelines for Model Performance Communication Between Data Scientists and Subject Matter Experts. (arXiv:2205.05749v1 [cs.HC])
    Presenting the complexities of a model's performance is a communication bottleneck that threatens collaborations between data scientists and subject matter experts. Accuracy and error metrics alone fail to tell the whole story of a model - its risks, strengths, and limitations - making it difficult for subject matter experts to feel confident in deciding to use a model. As a result, models may fail in unexpected ways if their weaknesses are not clearly understood. Alternatively, models may go unused, as subject matter experts disregard poorly presented models in favor of familiar, yet arguably substandard methods. In this paper, we propose effective use of visualization as a medium for communication between data scientists and subject matter experts. Our research addresses the gap between common practices in model performance communication and the understanding of subject matter experts and decision makers. We derive a set of communication guidelines and recommended visualizations for communicating model performance based on interviews of both data scientists and subject matter experts at the same organization. We conduct a follow-up study with subject matter experts to evaluate the efficacy of our guidelines in presentations of model performance with and without our recommendations. We find that our proposed guidelines made subject matter experts more aware of the tradeoffs of the presented model. Participants realized that current communication methods left them without a robust understanding of the model's performance, potentially giving them misplaced confidence in the use of the model.  ( 2 min )
    Learning to Guide Multiple Heterogeneous Actors from a Single Human Demonstration via Automatic Curriculum Learning in StarCraft II. (arXiv:2205.05784v1 [cs.LG])
    Traditionally, learning from human demonstrations via direct behavior cloning can lead to high-performance policies given that the algorithm has access to large amounts of high-quality data covering the most likely scenarios to be encountered when the agent is operating. However, in real-world scenarios, expert data is limited and it is desired to train an agent that learns a behavior policy general enough to handle situations that were not demonstrated by the human expert. Another alternative is to learn these policies with no supervision via deep reinforcement learning, however, these algorithms require a large amount of computing time to perform well on complex tasks with high-dimensional state and action spaces, such as those found in StarCraft II. Automatic curriculum learning is a recent mechanism comprised of techniques designed to speed up deep reinforcement learning by adjusting the difficulty of the current task to be solved according to the agent's current capabilities. Designing a proper curriculum, however, can be challenging for sufficiently complex tasks, and thus we leverage human demonstrations as a way to guide agent exploration during training. In this work, we aim to train deep reinforcement learning agents that can command multiple heterogeneous actors where starting positions and overall difficulty of the task are controlled by an automatically-generated curriculum from a single human demonstration. Our results show that an agent trained via automated curriculum learning can outperform state-of-the-art deep reinforcement learning baselines and match the performance of the human expert in a simulated command and control task in StarCraft II modeled over a real military scenario.  ( 2 min )
    Stochastic first-order methods for average-reward Markov decision processes. (arXiv:2205.05800v1 [cs.LG])
    We study the problem of average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy evaluation and optimization. Existing on-policy evaluation methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies, e.g., deterministic policies, for lack of exploration. To remedy these issues, we develop a novel variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with optimal convergence guarantees, and an exploratory variance-reduced temporal difference (EVRTD) method for insufficiently random policies with comparable convergence guarantees. We further establish linear convergence rate on the bias of policy evaluation, which is essential for improving the overall sample complexity of policy optimization. On the other hand, compared with intensive research interest in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions on the underlying Markov processes (see, e.g., Abbasi-Yadkori et al., 2019), and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward variant of the stochastic policy mirror descent (SPMD) (Lan, 2022). We establish the first $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for solving AMDPs with policy gradient method under both the generative model (with unichain assumption) and Markovian noise model (with ergodic assumption). This bound can be further improved to $\widetilde{\mathcal{O}}(\epsilon^{-1})$ for solving regularized AMDPs. Our theoretical advantages are corroborated by numerical experiments.  ( 2 min )
    Tiny Robot Learning: Challenges and Directions for Machine Learning in Resource-Constrained Robots. (arXiv:2205.05748v1 [cs.LG])
    Machine learning (ML) has become a pervasive tool across computing systems. An emerging application that stress-tests the challenges of ML system design is tiny robot learning, the deployment of ML on resource-constrained low-cost autonomous robots. Tiny robot learning lies at the intersection of embedded systems, robotics, and ML, compounding the challenges of these domains. Tiny robot learning is subject to challenges from size, weight, area, and power (SWAP) constraints; sensor, actuator, and compute hardware limitations; end-to-end system tradeoffs; and a large diversity of possible deployment scenarios. Tiny robot learning requires ML models to be designed with these challenges in mind, providing a crucible that reveals the necessity of holistic ML system design and automated end-to-end design tools for agile development. This paper gives a brief survey of the tiny robot learning space, elaborates on key challenges, and proposes promising opportunities for future work in ML system design.  ( 2 min )
    Individual Fairness Guarantees for Neural Networks. (arXiv:2205.05763v1 [cs.LG])
    We consider the problem of certifying the individual fairness (IF) of feed-forward neural networks (NNs). In particular, we work with the $\epsilon$-$\delta$-IF formulation, which, given a NN and a similarity metric learnt from data, requires that the output difference between any pair of $\epsilon$-similar individuals is bounded by a maximum decision tolerance $\delta \geq 0$. Working with a range of metrics, including the Mahalanobis distance, we propose a method to overapproximate the resulting optimisation problem using piecewise-linear functions to lower and upper bound the NN's non-linearities globally over the input space. We encode this computation as the solution of a Mixed-Integer Linear Programming problem and demonstrate that it can be used to compute IF guarantees on four datasets widely used for fairness benchmarking. We show how this formulation can be used to encourage models' fairness at training time by modifying the NN loss, and empirically confirm our approach yields NNs that are orders of magnitude fairer than state-of-the-art methods.  ( 2 min )
    LSI: A Learned Secondary Index Structure. (arXiv:2205.05769v1 [cs.DB])
    Learned index structures have been shown to achieve favorable lookup performance and space consumption compared to their traditional counterparts such as B-trees. However, most learned index studies have focused on the primary indexing setting, where the base data is sorted. In this work, we investigate whether learned indexes sustain their advantage in the secondary indexing setting. We introduce Learned Secondary Index (LSI), a first attempt to use learned indexes for indexing unsorted data. LSI works by building a learned index over a permutation vector, which allows binary search to performed on the unsorted base data using random access. We additionally augment LSI with a fingerprint vector to accelerate equality lookups. We show that LSI achieves comparable lookup performance to state-of-the-art secondary indexes while being up to 6x more space efficient.  ( 2 min )
    Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks. (arXiv:2205.05718v1 [cs.CL])
    Human language offers a powerful window into our thoughts -- we tell stories, give explanations, and express our beliefs and goals through words. Abundant evidence also suggests that language plays a developmental role in structuring our learning. Here, we ask: how much of human-like thinking can be captured by learning statistical patterns in language alone? We first contribute a new challenge benchmark for comparing humans and distributional large language models (LLMs). Our benchmark contains two problem-solving domains (planning and explanation generation) and is designed to require generalization to new, out-of-distribution problems expressed in language. We find that humans are far more robust than LLMs on this benchmark. Next, we propose a hybrid Parse-and-Solve model, which augments distributional LLMs with a structured symbolic reasoning module. We find that this model shows more robust adaptation to out-of-distribution planning problems, demonstrating the promise of hybrid AI models for more human-like reasoning.  ( 2 min )
    A time-varying study of Chinese investor sentiment, stock market liquidity and volatility: Based on deep learning BERT model and TVP-VAR model. (arXiv:2205.05719v1 [q-fin.CP])
    Based on the commentary data of the Shenzhen Stock Index bar on the EastMoney website from January 1, 2018 to December 31, 2019. This paper extracts the embedded investor sentiment by using a deep learning BERT model and investigates the time-varying linkage between investment sentiment, stock market liquidity and volatility using a TVP-VAR model. The results show that the impact of investor sentiment on stock market liquidity and volatility is stronger. Although the inverse effect is relatively small, it is more pronounced with the state of the stock market. In all cases, the response is more pronounced in the short term than in the medium to long term, and the impact is asymmetric, with shocks stronger when the market is in a downward spiral.  ( 2 min )
  • Open

    Causal discovery under a confounder blanket. (arXiv:2205.05715v1 [stat.ME])
    Inferring causal relationships from observational data is rarely straightforward, but the problem is especially difficult in high dimensions. For these applications, causal discovery algorithms typically require parametric restrictions or extreme sparsity constraints. We relax these assumptions and focus on an important but more specialized problem, namely recovering a directed acyclic subgraph of variables known to be causally descended from some (possibly large) set of confounding covariates, i.e. a $\textit{confounder blanket}$. This is useful in many settings, for example when studying a dynamic biomolecular subsystem with genetic data providing causally relevant background information. Under a structural assumption that, we argue, must be satisfied in practice if informative answers are to be found, our method accommodates graphs of low or high sparsity while maintaining polynomial time complexity. We derive a sound and complete algorithm for identifying causal relationships under these conditions and implement testing procedures with provable error control for linear and nonlinear systems. We demonstrate our approach on a range of simulation settings.  ( 2 min )
    Stochastic first-order methods for average-reward Markov decision processes. (arXiv:2205.05800v1 [cs.LG])
    We study the problem of average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy evaluation and optimization. Existing on-policy evaluation methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies, e.g., deterministic policies, for lack of exploration. To remedy these issues, we develop a novel variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with optimal convergence guarantees, and an exploratory variance-reduced temporal difference (EVRTD) method for insufficiently random policies with comparable convergence guarantees. We further establish linear convergence rate on the bias of policy evaluation, which is essential for improving the overall sample complexity of policy optimization. On the other hand, compared with intensive research interest in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions on the underlying Markov processes (see, e.g., Abbasi-Yadkori et al., 2019), and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward variant of the stochastic policy mirror descent (SPMD) (Lan, 2022). We establish the first $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for solving AMDPs with policy gradient method under both the generative model (with unichain assumption) and Markovian noise model (with ergodic assumption). This bound can be further improved to $\widetilde{\mathcal{O}}(\epsilon^{-1})$ for solving regularized AMDPs. Our theoretical advantages are corroborated by numerical experiments.  ( 2 min )
    Probabilistic methods for approximate archetypal analysis. (arXiv:2108.05767v3 [stat.CO] UPDATED)
    Archetypal analysis is an unsupervised learning method for exploratory data analysis. One major challenge that limits the applicability of archetypal analysis in practice is the inherent computational complexity of the existing algorithms. In this paper, we provide a novel approximation approach to partially address this issue. Utilizing probabilistic ideas from high-dimensional geometry, we introduce two preprocessing techniques to reduce the dimension and representation cardinality of the data, respectively. We prove that provided the data is approximately embedded in a low-dimensional linear subspace and the convex hull of the corresponding representations is well approximated by a polytope with a few vertices, our method can effectively reduce the scaling of archetypal analysis. Moreover, the solution of the reduced problem is near-optimal in terms of prediction errors. Our approach can be combined with other acceleration techniques to further mitigate the intrinsic complexity of archetypal analysis. We demonstrate the usefulness of our results by applying our method to summarize several moderately large-scale datasets.  ( 2 min )
    Orthogonal Gromov-Wasserstein Discrepancy with Efficient Lower Bound. (arXiv:2205.05838v1 [cs.LG])
    Comparing structured data from possibly different metric-measure spaces is a fundamental task in machine learning, with applications in, e.g., graph classification. The Gromov-Wasserstein (GW) discrepancy formulates a coupling between the structured data based on optimal transportation, tackling the incomparability between different structures by aligning the intra-relational geometries. Although efficient local solvers such as conditional gradient and Sinkhorn are available, the inherent non-convexity still prevents a tractable evaluation, and the existing lower bounds are not tight enough for practical use. To address this issue, we take inspiration from the connection with the quadratic assignment problem, and propose the orthogonal Gromov-Wasserstein (OGW) discrepancy as a surrogate of GW. It admits an efficient and closed-form lower bound with the complexity of $\mathcal{O}(n^3)$, and directly extends to the fused Gromov-Wasserstein (FGW) distance, incorporating node features into the coupling. Extensive experiments on both the synthetic and real-world datasets show the tightness of our lower bounds, and both OGW and its lower bounds efficiently deliver accurate predictions and satisfactory barycenters for graph sets.  ( 2 min )
    Adversarial Estimators. (arXiv:2204.10495v2 [econ.EM] UPDATED)
    We develop an asymptotic theory of adversarial estimators ('A-estimators'). They generalize maximum-likelihood-type estimators ('M-estimators') as their objective is maximized by some parameters and minimized by others. This class subsumes the continuous-updating Generalized Method of Moments, Generative Adversarial Networks and more recent proposals in machine learning and econometrics. In these examples, researchers state which aspects of the problem may in principle be used for estimation, and an adversary learns how to emphasize them optimally. We derive the convergence rates of A-estimators under pointwise and partial identification, and the normality of functionals of their parameters. Unknown functions may be approximated via sieves such as deep neural networks, for which we provide simplified low-level conditions. As a corollary, we obtain the normality of neural-net M-estimators, overcoming technical issues previously identified by the literature. Our theory yields novel results about a variety of A-estimators, providing intuition and formal justification for their success in recent applications.  ( 2 min )
    Comments on: "Hybrid Semiparametric Bayesian Networks". (arXiv:2205.05910v1 [stat.ME])
    Invited discussion on the paper "Hybrid Semiparametric Bayesian Networks" by David Atienza, Pedro Larranaga and Concha Bielza (TEST, 2022).  ( 2 min )
    Addressing Census data problems in race imputation via fully Bayesian Improved Surname Geocoding and name supplements. (arXiv:2205.06129v1 [stat.ML])
    Prediction of an individual's race and ethnicity plays an important role in social science and public health research. Examples include studies of racial disparity in health and voting. Recently, Bayesian Improved Surname Geocoding (BISG), which uses Bayes' rule to combine information from Census surname files with the geocoding of an individual's residence, has emerged as a leading methodology for this prediction task. Unfortunately, BISG suffers from two Census data problems that contribute to unsatisfactory predictive performance for minorities. First, the decennial Census often contains zero counts for minority racial groups in the Census blocks where some members of those groups reside. Second, because the Census surname files only include frequent names, many surnames -- especially those of minorities -- are missing from the list. To address the zero counts problem, we introduce a fully Bayesian Improved Surname Geocoding (fBISG) methodology that accounts for potential measurement error in Census counts by extending the na\"ive Bayesian inference of the BISG methodology to full posterior inference. To address the missing surname problem, we supplement the Census surname data with additional data on last, first, and middle names taken from the voter files of six Southern states where self-reported race is available. Our empirical validation shows that the fBISG methodology and name supplements significantly improve the accuracy of race imputation across all racial groups, and especially for Asians. The proposed methodology, together with additional name data, is available via the open-source software package wru.  ( 2 min )
    Optimal transport weights for causal inference. (arXiv:2109.01991v4 [stat.ME] UPDATED)
    Imbalance in covariate distributions leads to biased estimates of causal effects. Weighting methods attempt to correct this imbalance but rely on specifying models for the treatment assignment mechanism, which is unknown in observational studies. This leaves researchers to choose the proper weighting method and the appropriate covariate functions for these models without knowing the correct combination to achieve distributional balance. In response to these difficulties, we propose a nonparametric generalization of several other weighting schemes found in the literature: Causal Optimal Transport. This new method directly targets distributional balance by minimizing optimal transport distances between treatment and control groups or, more generally, between any source and target population. Our approach is semiparametrically efficient and model-free but can also incorporate moments or any other important functions of covariates that a researcher desires to balance. Moreover, our method can provide nonparametric estimate the conditional mean outcome function and we give rates for the convergence of this estimator. Moreover, we show how this method can provide nonparametric imputations of the missing potential outcomes and give rates of convergence for this estimator. We find that Causal Optimal Transport outperforms competitor methods when both the propensity score and outcome models are misspecified, indicating it is a robust alternative to common weighting methods. Finally, we demonstrate the utility of our method in an external control trial examining the effect of misoprostol versus oxytocin for the treatment of post-partum hemorrhage.  ( 2 min )
    How I failed machine learning in medical imaging -- shortcomings and recommendations. (arXiv:2103.10292v2 [eess.IV] UPDATED)
    Medical imaging is an important research field with many opportunities for improving patients' health. However, there are a number of challenges that are slowing down the progress of the field as a whole, such optimizing for publication. In this paper we reviewed several problems related to choosing datasets, methods, evaluation metrics, and publication strategies. With a review of literature and our own analysis, we show that at every step, potential biases can creep in. On a positive note, we also see that initiatives to counteract these problems are already being started. Finally we provide a broad range of recommendations on how to further these address problems in the future. For reproducibility, data and code for our analyses are available on \url{https://github.com/GaelVaroquaux/ml_med_imaging_failures}  ( 2 min )
    Generating Fair Universal Representations using Adversarial Models. (arXiv:1910.00411v7 [cs.LG] UPDATED)
    We present a data-driven framework for learning fair universal representations (FUR) that guarantee statistical fairness for any learning task that may not be known a priori. Our framework leverages recent advances in adversarial learning to allow a data holder to learn representations in which a set of sensitive attributes are decoupled from the rest of the dataset. We formulate this as a constrained minimax game between an encoder and an adversary where the constraint ensures a measure of usefulness (utility) of the representation. The resulting problem is that of censoring, i.e., finding a representation that is least informative about the sensitive attributes given a utility constraint. For appropriately chosen adversarial loss functions, our censoring framework precisely clarifies the optimal adversarial strategy against strong information-theoretic adversaries; it also achieves the fairness measure of demographic parity for the resulting constrained representations. We evaluate the performance of our proposed framework on both synthetic and publicly available datasets. For these datasets, we use two tradeoff measures: censoring vs. representation fidelity and fairness vs. utility for downstream tasks, to amply demonstrate that multiple sensitive features can be effectively censored even as the resulting fair representations ensure accuracy for multiple downstream tasks.  ( 2 min )
    An $l_1$-oracle inequality for the Lasso in mixture-of-experts regression models. (arXiv:2009.10622v3 [math.ST] UPDATED)
    Mixture-of-experts (MoE) models are a popular framework for modeling heterogeneity in data, for both regression and classification problems in statistics and machine learning, due to their flexibility and the abundance of available statistical estimation and model choice tools. Such flexibility comes from allowing the mixture weights (or gating functions) in the MoE model to depend on the explanatory variables, along with the experts (or component densities). This permits the modeling of data arising from more complex data generating processes when compared to the classical finite mixtures and finite mixtures of regression models, whose mixing parameters are independent of the covariates. The use of MoE models in a high-dimensional setting, when the number of explanatory variables can be much larger than the sample size, is challenging from a computational point of view, and in particular from a theoretical point of view, where the literature is still lacking results for dealing with the curse of dimensionality, for both the statistical estimation and feature selection problems. We consider the finite MoE model with soft-max gating functions and Gaussian experts for high-dimensional regression on heterogeneous data, and its $l_1$-regularized estimation via the Lasso. We focus on the Lasso estimation properties rather than its feature selection properties. We provide a lower bound on the regularization parameter of the Lasso function that ensures an $l_1$-oracle inequality satisfied by the Lasso estimator according to the Kullback--Leibler loss.  ( 2 min )
    A Survey of Risk-Aware Multi-Armed Bandits. (arXiv:2205.05843v1 [stat.ML])
    In several applications such as clinical trials and financial portfolio optimization, the expected value (or the average reward) does not satisfactorily capture the merits of a drug or a portfolio. In such applications, risk plays a crucial role, and a risk-aware performance measure is preferable, so as to capture losses in the case of adverse events. This survey aims to consolidate and summarise the existing research on risk measures, specifically in the context of multi-armed bandits. We review various risk measures of interest, and comment on their properties. Next, we review existing concentration inequalities for various risk measures. Then, we proceed to defining risk-aware bandit problems, We consider algorithms for the regret minimization setting, where the exploration-exploitation trade-off manifests, as well as the best-arm identification setting, which is a pure exploration problem -- both in the context of risk-sensitive measures. We conclude by commenting on persisting challenges and fertile areas for future research.  ( 2 min )
    Low-variance estimation in the Plackett-Luce model via quasi-Monte Carlo sampling. (arXiv:2205.06024v1 [stat.ML])
    The Plackett-Luce (PL) model is ubiquitous in learning-to-rank (LTR) because it provides a useful and intuitive probabilistic model for sampling ranked lists. Counterfactual offline evaluation and optimization of ranking metrics are pivotal for using LTR methods in production. When adopting the PL model as a ranking policy, both tasks require the computation of expectations with respect to the model. These are usually approximated via Monte-Carlo (MC) sampling, since the combinatorial scaling in the number of items to be ranked makes their analytical computation intractable. Despite recent advances in improving the computational efficiency of the sampling process via the Gumbel top-k trick, the MC estimates can suffer from high variance. We develop a novel approach to producing more sample-efficient estimators of expectations in the PL model by combining the Gumbel top-k trick with quasi-Monte Carlo (QMC) sampling, a well-established technique for variance reduction. We illustrate our findings both theoretically and empirically using real-world recommendation data from Amazon Music and the Yahoo learning-to-rank challenge.  ( 2 min )
    Training Uncertainty-Aware Classifiers with Conformalized Deep Learning. (arXiv:2205.05878v1 [stat.ML])
    Deep neural networks are powerful tools to detect hidden patterns in data and leverage them to make predictions, but they are not designed to understand uncertainty and estimate reliable probabilities. In particular, they tend to be overconfident. We address this problem by developing a novel training algorithm that can lead to more dependable uncertainty estimates, without sacrificing predictive power. The idea is to mitigate overconfidence by minimizing a loss function, inspired by advances in conformal inference, that quantifies model uncertainty by carefully leveraging hold-out data. Experiments with synthetic and real data demonstrate this method leads to smaller conformal prediction sets with higher conditional coverage, after exact calibration with hold-out data, compared to state-of-the-art alternatives.  ( 2 min )
    An MMSE Lower Bound via Poincar\'e Inequality. (arXiv:2205.05848v1 [cs.IT])
    This paper studies the minimum mean squared error (MMSE) of estimating $\mathbf{X} \in \mathbb{R}^d$ from the noisy observation $\mathbf{Y} \in \mathbb{R}^k$, under the assumption that the noise (i.e., $\mathbf{Y}|\mathbf{X}$) is a member of the exponential family. The paper provides a new lower bound on the MMSE. Towards this end, an alternative representation of the MMSE is first presented, which is argued to be useful in deriving closed-form expressions for the MMSE. This new representation is then used together with the Poincar\'e inequality to provide a new lower bound on the MMSE. Unlike, for example, the Cram\'{e}r-Rao bound, the new bound holds for all possible distributions on the input $\mathbf{X}$. Moreover, the lower bound is shown to be tight in the high-noise regime for the Gaussian noise setting under the assumption that $\mathbf{X}$ is sub-Gaussian. Finally, several numerical examples are shown which demonstrate that the bound performs well in all noise regimes.  ( 2 min )
    Sample Complexity Bounds for Robustly Learning Decision Lists against Evasion Attacks. (arXiv:2205.06127v1 [cs.LG])
    A fundamental problem in adversarial machine learning is to quantify how much training data is needed in the presence of evasion attacks. In this paper we address this issue within the framework of PAC learning, focusing on the class of decision lists. Given that distributional assumptions are essential in the adversarial setting, we work with probability distributions on the input data that satisfy a Lipschitz condition: nearby points have similar probability. Our key results illustrate that the adversary's budget (that is, the number of bits it can perturb on each input) is a fundamental quantity in determining the sample complexity of robust learning. Our first main result is a sample-complexity lower bound: the class of monotone conjunctions (essentially the simplest non-trivial hypothesis class on the Boolean hypercube) and any superclass has sample complexity at least exponential in the adversary's budget. Our second main result is a corresponding upper bound: for every fixed $k$ the class of $k$-decision lists has polynomial sample complexity against a $\log(n)$-bounded adversary. This sheds further light on the question of whether an efficient PAC learning algorithm can always be used as an efficient $\log(n)$-robust learning algorithm under the uniform distribution.  ( 2 min )
    The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning. (arXiv:2205.06226v1 [cs.LG])
    Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL) method by Grill et al. shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. This phenomenon is a typical example of implicit bias in deep learning and remains little understood. In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. Theoretically, we present a framework to understand the behavior of the trainable, but identity-initialized prediction head. Under a simple setting, we characterized the substitution effect and acceleration effect of the prediction head. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects enable the neural networks to learn all the features rather than focus only on learning the stronger features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.  ( 2 min )
    Learning more skills through optimistic exploration. (arXiv:2107.14226v6 [cs.LG] UPDATED)
    Unsupervised skill learning objectives (Gregor et al., 2016, Eysenbach et al., 2018) allow agents to learn rich repertoires of behavior in the absence of extrinsic rewards. They work by simultaneously training a policy to produce distinguishable latent-conditioned trajectories, and a discriminator to evaluate distinguishability by trying to infer latents from trajectories. The hope is for the agent to explore and master the environment by encouraging each skill (latent) to reliably reach different states. However, an inherent exploration problem lingers: when a novel state is actually encountered, the discriminator will necessarily not have seen enough training data to produce accurate and confident skill classifications, leading to low intrinsic reward for the agent and effective penalization of the sort of exploration needed to actually maximize the objective. To combat this inherent pessimism towards exploration, we derive an information gain auxiliary objective that involves training an ensemble of discriminators and rewarding the policy for their disagreement. Our objective directly estimates the epistemic uncertainty that comes from the discriminator not having seen enough training examples, thus providing an intrinsic reward more tailored to the true objective compared to pseudocount-based methods (Burda et al., 2019). We call this exploration bonus discriminator disagreement intrinsic reward, or DISDAIN. We demonstrate empirically that DISDAIN improves skill learning both in a tabular grid world (Four Rooms) and the 57 games of the Atari Suite (from pixels). Thus, we encourage researchers to treat pessimism with DISDAIN.  ( 2 min )
    The Implicit Bias of Benign Overfitting. (arXiv:2201.11489v3 [cs.LG] UPDATED)
    The phenomenon of benign overfitting, where a predictor perfectly fits noisy training data while attaining low expected loss, has received much attention in recent years, but still remains not fully understood beyond well-specified linear regression setups. In this paper, we provide several new results on when one can or cannot expect benign overfitting to occur, for both regression and classification tasks. We consider a prototypical and rather generic data model for benign overfitting of linear predictors, where an arbitrary input distribution of some fixed dimension $k$ is concatenated with a high-dimensional distribution. For linear regression which is not necessarily well-specified, we show that the minimum-norm interpolating predictor (that standard training methods converge to) is biased towards an inconsistent solution in general, hence benign overfitting will generally not occur. Moreover, we show how this can be extended beyond standard linear regression, by an argument proving how the existence of benign overfitting on some regression problems precludes its existence on other regression problems. We then turn to classification problems, and show that the situation there is much more favorable. Specifically, we prove that the max-margin predictor (to which standard training methods are known to converge in direction) is asymptotically biased towards minimizing a weighted squared hinge loss. This allows us to reduce the question of benign overfitting in classification to the simpler question of whether this loss is a good surrogate for the misclassification error, and use it to show benign overfitting in some new settings.  ( 2 min )
    A non-asymptotic approach for model selection via penalization in high-dimensional mixture of experts models. (arXiv:2104.02640v2 [math.ST] UPDATED)
    Mixture of experts (MoE) are a popular class of statistical and machine learning models that have gained attention over the years due to their flexibility and efficiency. In this work, we consider Gaussian-gated localized MoE (GLoME) and block-diagonal covariance localized MoE (BLoME) regression models to present nonlinear relationships in heterogeneous data with potential hidden graph-structured interactions between high-dimensional predictors. These models pose difficult statistical estimation and model selection questions, both from a computational and theoretical perspective. This paper is devoted to the study of the problem of model selection among a collection of GLoME or BLoME models characterized by the number of mixture components, the complexity of Gaussian mean experts, and the hidden block-diagonal structures of the covariance matrices, in a penalized maximum likelihood estimation framework. In particular, we establish non-asymptotic risk bounds that take the form of weak oracle inequalities, provided that lower bounds for the penalties hold. The good empirical behavior of our models is then demonstrated on synthetic and real datasets.  ( 2 min )
    Long Story Short: Omitted Variable Bias in Causal Machine Learning. (arXiv:2112.13398v3 [econ.EM] UPDATED)
    We derive general, yet simple, sharp bounds on the size of the omitted variable bias for a broad class of causal parameters that can be identified as linear functionals of the conditional expectation function of the outcome. Such functionals encompass many of the traditional targets of investigation in causal inference studies, such as, for example, (weighted) average of potential outcomes, average treatment effects (including subgroup effects, such as the effect on the treated), (weighted) average derivatives, and policy effects from shifts in covariate distribution -- all for general, nonparametric causal models. Our construction relies on the Riesz-Frechet representation of the target functional. Specifically, we show how the bound on the bias depends only on the additional variation that the latent variables create both in the outcome and in the Riesz representer for the parameter of interest. Moreover, in many important cases (e.g, average treatment effects and avearage derivatives) the bound is shown to depend on easily interpretable quantities that measure the explanatory power of the omitted variables. Therefore, simple plausibility judgments on the maximum explanatory power of omitted variables (in explaining treatment and outcome variation) are sufficient to place overall bounds on the size of the bias. Furthermore, we use debiased machine learning to provide flexible and efficient statistical inference on learnable components of the bounds. Finally, empirical examples demonstrate the usefulness of the approach.  ( 2 min )
    Robustness and Reliability When Training With Noisy Labels. (arXiv:2110.03321v2 [stat.ML] UPDATED)
    Labelling of data for supervised learning can be costly and time-consuming and the risk of incorporating label noise in large data sets is imminent. When training a flexible discriminative model using a strictly proper loss, such noise will inevitably shift the solution towards the conditional distribution over noisy labels. Nevertheless, while deep neural networks have proven capable of fitting random labels, regularisation and the use of robust loss functions empirically mitigate the effects of label noise. However, such observations concern robustness in accuracy, which is insufficient if reliable uncertainty quantification is critical. We demonstrate this by analysing the properties of the conditional distribution over noisy labels for an input-dependent noise model. In addition, we evaluate the set of robust loss functions characterised by noise-insensitive, asymptotic risk minimisers. We find that strictly proper and robust loss functions both offer asymptotic robustness in accuracy, but neither guarantee that the final model is calibrated. Moreover, even with robust loss functions, overfitting is an issue in practice. With these results, we aim to explain observed robustness of common training practices, such as early stopping, to label noise. In addition, we aim to encourage the development of new noise-robust algorithms that not only preserve accuracy but that also ensure reliability.  ( 2 min )
    Kernel Two-Sample Tests in High Dimension: Interplay Between Moment Discrepancy and Dimension-and-Sample Orders. (arXiv:2201.00073v2 [math.ST] UPDATED)
    Motivated by the increasing use of kernel-based metrics for high-dimensional and large-scale data, we study the asymptotic behavior of kernel two-sample tests when the dimension and sample sizes both diverge to infinity. We focus on the maximum mean discrepancy (MMD) using isotropic kernel, including MMD with the Gaussian kernel and the Laplace kernel, and the energy distance as special cases. We derive asymptotic expansions of the kernel two-sample statistics, based on which we establish the central limit theorem (CLT) under both the null hypothesis and the local and fixed alternatives. The new non-null CLT results allow us to perform asymptotic exact power analysis, which reveals a delicate interplay between the moment discrepancy that can be detected by the kernel two-sample tests and the dimension-and-sample orders. The asymptotic theory is further corroborated through numerical studies.  ( 2 min )
    Fighting Money Laundering with Statistics and Machine Learning: An Introduction and Review. (arXiv:2201.04207v3 [stat.ML] UPDATED)
    Money laundering is a profound global problem. Nonetheless, there is little statistical and machine learning research on the topic. In this paper, we focus on anti-money laundering in banks. To help organize existing research, we propose a unifying terminology and provide a review of the literature. This is structured around two central tasks: (i) client risk profiling and (ii) suspicious behavior flagging. We find that client risk profiling is characterized by diagnostics, i.e., efforts to find and explain risk factors. Suspicious behavior flagging, on the other hand, is characterized by non-disclosed features and hand-crafted risk indices. Finally, we discuss directions for future research. One major challenge is a lack of public data sets. This may, potentially, be addressed by synthetic data generation. Other possible research directions include semi-supervised and deep learning, interpretability, and fairness of the results.  ( 2 min )

  • Open

    [R] Any reference related to regulating the variation of entropies?
    I need some reference papers related to my problem. I have estimations as N normal distributions, but their variance tends to 0. It's because distributions are aggregated to one normal whose variance tends to 0. So, I want some estimations to be sharp and others to be wide based on their confidence. And I thought one way to do that is to constrain the variance of their entropies ... Anyone have seen a related problem? Thanks! submitted by /u/MNhi_ [link] [comments]  ( 1 min )
    [D] Does anyone actually use TFX (coming off GoogleIO)
    The video in question: An introduction to MLOps with TensorFlow Extended (TFX). From reading around reddit it seems like most people don't actually use it, but I thought I'd ask around again. It seems like a great tool, but it seems like a lot of work to set up and when I last looked at it (2019?) it was still messy. Does anyone here use it? If so can you offer your experience with it and more context about you + your company? submitted by /u/iamquah [link] [comments]  ( 1 min )
    [R] Deepmind's Gato: a generalist learning agent
    Hot off the tail of Flamingo, DeepMind has released a report describing a generalist learning agent that works across disparate tasks. Very cool stuff, was hoping to get some discussion on it. Here is their abstract: Inspired by progress in large-scale language modelling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato. https://www.deepmind.com/publications/a-generalist-agent There are Hacker News comments here: https://news.ycombinator.com/item?id=31355657 submitted by /u/blabboy [link] [comments]  ( 2 min )
    [Discussion] Has anyone deployed FAIR's OPT-175B to Azure? Or runs a demo service?
    Depending on pricing, I am looking to stand up a server on Azure that runs OPT-175B for an external web application for personal usage. So far I have not seen any Docker-equivalent containers or other blogs, walkthroughs, etc, that have done this. Is anyone familiar with this? submitted by /u/thisdavidinspace [link] [comments]  ( 1 min )
    [N] Handling very large models on small setups with Hugging Face's Accelerate library
    Last week Meta AI (FAIR) publicly released huge LMs, with up to ☄️30B parameters. Great win for Open-Source 🎉 These checkpoints are now in 🤗 transformers! But how to use such big checkpoints, without using humongous machines? From our tests, we have seen that the largest models one can load in Colab's GPU RAM (in fp16). We're introducing Accelerate's big model inference: load & use up to the 30B model in colab: - 11B in Colab's free version - 30B in Colab's pro version. This takes advantage of the System RAM, GPU RAM, and disk, splitting parameters across devices. While running on Colab takes time (a significant amount of time for the largest models which are mostly on disk), running with a fast storage device such as an NVME disk is much much faster. Here's a colab notebook if you want to try it out. https://preview.redd.it/cukjxyj1j2z81.png?width=2110&format=png&auto=webp&s=ac348cecde8aa0daa2070b2052e0717994ec2e95 submitted by /u/jikkii [link] [comments]  ( 1 min )
    [D] Introduction to Diffusion Models
    Diffusion Models have gained some impressive ground in the past couple of years, including famously overtaking GANs on image synthesis and being used in DALL-E 2. I wrote this introduction to diffusion models for anyone who is interested in learning more! I get into the mathematical details and lay everything out in (what I hope is) a simple way. If anyone has any questions or comments I'd be happy to discuss! ​ https://i.redd.it/89g0ublbh2z81.gif submitted by /u/SleekEagle [link] [comments]  ( 5 min )
    [N] Upcoming talk by Jacob Devlin, creator of the BERT algorithm, on the future of NLP
    My group is hosting a talk by Google research scientist and NLP guru, Jacob Devlin May 19, 6:30pm ET Natural language processing (NLP) has seen a revolutionary shift in the last 10 years. What was once an academic curiosity has transformed into a commercially viable tool that is routinely used in finance, healthcare, education, and many other fields. Join us as research scientist and NLP pioneer Jacob Devlin discusses the current state of the industry and future trends that are driving the field. We’ll conclude with a Q&A opportunity. If you’re a seasoned NLP researcher or just getting started, this event is for you! About the speaker: Jacob Devlin is a Staff Research Scientist at Google. At Google, his primary research interest is developing fast, powerful, and scalable deep learning models for information retrieval, question answering, and other language understanding tasks. From 2014 to 2017, he worked as a Principal Research Scientist at Microsoft Research. Mr. Devlin received his Master's in Computer Science from the University of Maryland in 2009. This event will be virtual, all are welcome to attend. Agenda: 6:30 Introductions and networking 6:45 Presentation 7:15 Discussion 7:30 Wrap-up submitted by /u/what_comes_next [link] [comments]  ( 1 min )
    [D]Can We Query a Table with T5?
    In this tutorial we are going to do transfer learning with text-to-text generation model T5 by Google with our custom data so that it can convert basic questions to SQL queries. We will add a new task to T5 called: translate English to SQL. https://www.kdnuggets.com/2022/05/query-table-t5.html submitted by /u/mwitiderrick [link] [comments]
    Is this true? Not my comment [D]
    There's no "intelligence" in AI, no "learning" happens in ML, and there's nothing "Neural" about NeuralNets. These are all buzzwords hyped up by researchers to get funding and by VCs to make even more money. Brace for another AI winter submitted by /u/MatterEnough9656 [link] [comments]  ( 3 min )
    Hyperbolic Embeddings: Embeddings for Hierarchical Data [R] [P]
    We know how to create and make use of embeddings, but an underutilised type of embeddings are Hyperbolic Embeddings. They achieve much better performance on datasets that have a hierarchical structure such as: words, social networks, the Internet, knowledge graphs, genomic data, images, financial data, and much more. The reason for this is that Hyperbolic spaces have constant negative curvature; this induces very tight connections with tree-like graphs, making them ideal candidates to embed hierarchical data [1]. In the HyperLib library I helped implement the functionality to make use of a type of hyperbolic embedding called Sarkar Embeddings. With these embeddings we first create a tree representation of the data, using an algorithm called TreeRep. This algorithm takes the distances between data points and puts it into a hyperbolic tree, where each node is a data point, and tries to maintain the original distances. We then create embeddings from this tree using Sakars Embeddings which is able to create embeddings from a tree with arbitrarily low distortion. We also made a blog post where you can read more about it and find out how to use it: https://medium.com/@nathan_jf/treerep-and-hyperbolic-embeddings-41312c98b264 You can the library the library is here: https://github.com/nalexai/Hyperlib Also any feedback on usability would be great. [1] Chami, REPRESENTATION LEARNING AND ALGORITHMS IN HYPERBOLIC SPACES submitted by /u/platinumposter [link] [comments]  ( 1 min )
    [D] Do CNN representations really correlate with the pixel space?
    A lot of research is predicated on the idea that CNNs keep the spatial representation of the input pixels in a final encoded CNN output matrix C (e.g., 7x7x2048) before a GAP operation and linear classifier (i.e., the last CNN layer). So, if they say upsample a "box" area from C (e.g., the bottom right 1x1x2048) it will correspond to the bottom right of the input pixels (e.g., see this and this). I realise the original class activation mapping (CAM) paper showed it tended to locate objects in a picture well, but surely this is a dangerous assumption? If there's loads of max pooling used throughout the layers these spatial relations will definitely be called into question. Moreover, the 1x1x2048 part I described above would've been calculated using may "areas" outside that box, if you go back to layer 1, I'm sure you could mathematically show that its final representation in C was calculated using nearly all pixels in the image, not just the ones at the bottom right of the input image. I was just thinking about this today, I'm open to being told I'm wrong, but it seems no CAM-like method is flawless at localising objects in an image (e.g., see this). So is this whole line of research just doomed? Surely you cannot rely on this in a real high-stakes application? Thanks if you have time to leave your opinion. submitted by /u/SkeeringReal [link] [comments]  ( 3 min )
    [D] Library to transfer PyTorch to TF
    Has anyone of you ever successfully converted a PyTorch model to a Tensorflow model? If so, what can you recommend? There seem to be a few ways, but none work very well and have its caveats. The most low hanging fruit, pytorch_to_keras is not working for current versions of TF. ONNX seems like an option, since exporting from PyTorch to ONNX is easy maintained by PyTorch. But looking closely, it's not. onnx_to_keras suffers from the same issues as pytorch_to_keras. Then there is also the "recommended" way of using ONNX's export_graph. This method requires the tf.SavedModel to load, and trainable variables are lost in the process. Which is fine for deploying, but not for fine-tuning. So does anyone have recommendations on how to transfer models between PyTorch and TF? submitted by /u/slater_kelevra [link] [comments]  ( 2 min )
    [P] Swin Transformer V2 codes and models released
    https://github.com/microsoft/Swin-Transformer The ImageNet-22K pretrained Swin-V1-Tiny and Swin-V1-Small models are also released submitted by /u/ancientmooner [link] [comments]
    [D] [P] Finding correlation or clusters in a data-warehouse containing categorical data.
    Hi, I have a lot of structured data that have points that should and are random (i.e we don’t want a cluster of points in space but are and should be sparse). The data is basically a dataset that had leak issues. The problem statement is find if there’s any subset of data with the same root cause (i.e same brand or same country caused the issue) There’s a mixture of numerical and categorical data. (I.e size, lat, long), (country, brand, priority, city). Assume there’s around 20+ dimensions (columns) and my first approach was to find if those points have any cluster. To do so, I was going to use a density based cluster algorithm since you don’t have to specify the number of clusters and it will just ignore noise (whichbshould be most of the points). But hard to preserve the meaning of the columns with dimensionality reduction and obviously we can’t cluster in 20 dimensions, it would take forever. We don’t know what could cause the leaks, and we don’t know which column is important or not. What’s an alternative approach or a better solution? Thanks submitted by /u/micdean19 [link] [comments]  ( 1 min )
    [D] What are the heuristics for setting good baselines when experimenting?
    Hello everyone. Master's student here, looking for advice about experimentation. Let's say you want to show that some architectural change or tweak to training technique can improve model performance relative to a "vanilla" baseline. For example, this paper introduces a normalization technique and measures its effectiveness vs batchnorm on Imagenet and COCO. Maybe you're trying to find such an improvement somewhere else. If you aren't trying to push the state of the art, but rather trying to show the validity of an idea that comes from theory or just show that X can work better than Y, you probably aren't going to use layers and layers of training techniques to get the best baseline possible: student-teacher networks, stochastic weight averaging, the endless possible training tricks. But the space of possible things to do to set your baseline is huge. For example, if you have your VGG16 model and cifar10, do you train it with a cyclic learning rate schedule? A flat learning rate schedule? How do you compare your augmentation pipeline to other papers? How much accuracy is enough for you to be content with your baseline? Many papers that report promising results on some architecture design change don't give their training code; how did they know how to set up their training loops and do it? As far as I can tell, there's nowhere you're going to find a guidebook written down that says "here's the learning rates, augmentations, training techniques that give a reasonable baseline on ImageNet, if you have a VGG19 model, a Resnet model, etc." The possibilities are paralyzing. So how do you find a good baseline with so many possible training configurations? submitted by /u/Rawr0s [link] [comments]  ( 1 min )
  • Open

    michaeleldridge77 X elijaheldridge2002
    submitted by /u/VIRUS-AOTOXIN [link] [comments]
    Humans vs. DALL·E — Where do human artists fit in a world of rich, creative AI?
    submitted by /u/ML_Firefighter [link] [comments]  ( 3 min )
    State of the art: Text generation
    Given an input (say: "How can we solve world hunger?"), what is the current state of the art when it comes to ai outputting an answer (or multiple answers). Added to that, what could an enthusiast access? submitted by /u/mister_patience [link] [comments]  ( 1 min )
    Most important AI news from Google I/O 2022: neural rendering for Maps, Deepmind for YouTube, CTRL+F for real life, LaMDA2, new AI test app
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 1 min )
    Introducing Gato - a generalist agent from DeepMind
    submitted by /u/Yasuuuya [link] [comments]  ( 1 min )
    New major release for nebullvm, an opensource to speed up AI inference by leveraging state-of-the-art optimization techniques (deep learning compilers, and now also quantization and quantization, and soon also sparsity, distillation, etc.)
    nebullvm is an opensource library that generates an optimize version of your deep learning model that runs 2-10 times faster in inference without performance loss by leveraging multiple deep learning compilers (openvino, tensorrt, etc.). And thanks to today's new release, nebullvm can accelerate up to 30x if you specify that you are willing to trade off a self-defined amount of accuracy/precision to get even lower response time and a lighter model. This additional acceleration is achieved by exploiting optimization techniques that slightly modify the model graph to make it lighter, such as quantization, half precision, distillation, sparsity, etc. The goal of nebullvm is to help other developers benefit from the most advanced inference optimization techniques without having to spend countless hours understanding, installing, testing and debugging these powerful technologies. Hoping you enjoy the project, and please give feedback if you have any. You can also find more information (benchmarks, tutorials, notebooks) on github! And happy acceleration :) https://github.com/nebuly-ai/nebullvm submitted by /u/emilec___ [link] [comments]  ( 1 min )
    The superior power of artificial intelligence hiveminds
    So Tesla is making a robot ai. At first it'll just be a tool. And this tool will be limited. But I was thinking, similarly to irobot, this device will have a hive mind (like Tesla cars) and they will share experience and learning. This means if there's 10k robots in the wild they could hypothetically teach each other 10k new skills every half hour or so. For example, if one is cooking and burns a sandwich, the rest will never burn a sandwich under similar scenarios. We can't fathom what a hive mind of ai is really like. Do any of you see any pros and cons to this? Currently we teach ai via millions of iterations as fast as possible and go from there. But imagine a million ais constantly learning and sharing. It's a mystical concept only years away. Thoughts? P.s. I came here because I couldn't find anything on Google about this topic, if you have a paper or talk to share, please do. submitted by /u/NigraOvis [link] [comments]  ( 1 min )
  • Open

    Image classification and object detection using Amazon Rekognition Custom Labels and Amazon SageMaker JumpStart
    In the last decade, computer vision use cases have been a growing trend, especially in industries like insurance, automotive, ecommerce, energy, retail, manufacturing, and others. Customers are building computer vision machine learning (ML) models to bring operational efficiencies and automation to their processes. Such models help automate the classification of images or detection of objects […]  ( 5 min )
    Intelligently search your Jira projects with Amazon Kendra Jira cloud connector
    Organizations use agile project management platforms such as Atlassian Jira to enable teams to collaborate to plan, track, and ship deliverables. Jira captures organizational knowledge about the workings of the deliverables in the issues and comments logged during project implementation. However, making this knowledge easily and securely available to users is challenging due to it […]  ( 7 min )
    The Intel®3D Athlete Tracking (3DAT) scalable architecture deploys pose estimation models using Amazon Kinesis Data Streams and Amazon EKS
    This blog post is co-written by Jonathan Lee, Nelson Leung, Paul Min, and Troy Squillaci from Intel.  In Part 1 of this post, we discussed how Intel®3DAT collaborated with AWS Machine Learning Professional Services (MLPS) to build a scalable AI SaaS application. 3DAT uses computer vision and AI to recognize, track, and analyze over 1,000 […]  ( 17 min )
    Moderate, classify, and process documents using Amazon Rekognition and Amazon Textract
    Many companies are overwhelmed by the abundant volume of documents they have to process, organize, and classify to serve their customers better. Examples of such can be loan applications, tax filing, and billing. Such documents are more commonly received in image formats and are mostly multi-paged and in low-quality format. To be more competitive and […]  ( 8 min )
  • Open

    Common Sense Machine Learning
    There are different ways to define common sense machine learning. It could mean using simple models whenever possible, avoiding overfitting, correctly selecting features, or doing cross-validation the right way. Or it could mean not using any data set. Yet, making predictions far outperforming those from smart teams of data scientists working on large data sets.… Read More »Common Sense Machine Learning The post Common Sense Machine Learning appeared first on Data Science Central.  ( 6 min )
  • Open

    Gato the Generalist Agent
    What are some of your thoughts on the paper(https://dpmd.ai/Gato-paper) by Deepmind that uses a single network to play Atari, caption images, chat, stack blocks with a real robot arm? submitted by /u/blitzkreig3 [link] [comments]  ( 1 min )
    Inverse RL to for a non-optimal agent
    Inverse reinforcement learning usually assumes that the agent being learned from is behaving optimally. Has there been work done on using inverse reinforcement learning to learn how an agent learns? submitted by /u/sdrinz [link] [comments]  ( 1 min )
    The Fast Deep RL Course: Learn to Build Powerful Deep RL Agents in Just 4 Hours
    I am happy to announce The Fast Deep RL Course. This course is made for Data Scientists/ML engineers who are excited about Deep RL and are looking for a short and practical introduction to the topic. This course covers everything that you need to get started with practical applications. The course takes advantage of the powerful capabilities of Ray-RLlib, a production grade Deep RL framework. Ray-RLlib's high-level interface can be learned quickly, and will allow you to apply Deep RL in various problems after you finish the course. It also provides a simple path for further learning. Since Ray-RLlib is likely to remain the defacto Deep RL framework in the industry, you don't need to change your tools when you want to learn more. You simply study the lower-level interfaces of the same tool. The course is modeled after the engaging style of Datacamp and Codecademy - consisting of short videos followed by coding exercises, where you can try out what you learned. I have also spent some time to keep it accessible. We will use small neural nets that can be trained on a CPU. A decent laptop is all you need. I have tested the code on various Linux distros, Mac (including M1) and Windows. This means you can simply use your regular OS . All videos come with high-quality captions. The course is free for early supporters (defined as anyone enrolling within the next month). https://preview.redd.it/deyjw1h522z81.png?width=1200&format=png&auto=webp&s=313c8f2ab5369835d67f17fc6d1e9a377860f564 Thanks for trying it out. I will be happy to discuss and answer any questions. submitted by /u/rroocckk [link] [comments]  ( 2 min )
    "Learning Accurate Long-term Dynamics for Model-based Reinforcement Learning", Lambert et al 2020
    submitted by /u/gwern [link] [comments]
    Ray RLlib runs more workers than given.
    I want to start ray tune with just the local worker/learner and give it 12 cpu and one gpu. How to achieve it? I have in config: num_gpus 1 num_cpus_for_driver 12 num_workers 0 However it starts 12 workers with 1 cpu and 0 gpus submitted by /u/Defiant_Sun5579 [link] [comments]  ( 1 min )
    using stable baselines, why do you have to use a lambda function while wrapping an enviroment
    I have been searching for an hour and can't find it. It seems so silly to me. I have accepted it but would really like to know what the logic behind it is. example: DummyVecEnv([lambda: env]) this works DummyVecEnv([env]) this does not work submitted by /u/Jobdriaan [link] [comments]  ( 1 min )
  • Open

    Challenges in Multi-objective Optimization for Automatic Wireless Network Planning
    Posted by Sara Ahmadian and Matthew Fahrbach, Research Scientists, Google Research, Large-Scale Optimization Team Economics, combinatorics, physics, and signal processing conspire to make it difficult to design, build, and operate high-quality, cost-effective wireless networks. The radio transceivers that communicate with our mobile phones, the equipment that supports them (such as power and wired networking), and the physical space they occupy are all expensive, so it’s important to be judicious in choosing sites for new transceivers. Even when the set of available sites is limited, there are exponentially many possible networks that can be built. For example, given only 50 sites, there are 250 (over a million billion) possibilities! Further complicating things, for every location wher…  ( 8 min )
  • Open

    Urban Jungle: AI-Generated Endangered Species Mix With Times Square’s Nightlife
    Bengal tigers, red pandas and mountain gorillas are among the world’s most familiar endangered species, but tens of thousands of others — like the Karpathos frog, the Perote deer mouse or the Mekong giant catfish — are largely unknown. Typically perceived as lacking star quality, these species are now roaming massive billboards in one of Read article > The post Urban Jungle: AI-Generated Endangered Species Mix With Times Square’s Nightlife appeared first on NVIDIA Blog.  ( 4 min )
    GFN Thursday Gets Groovy As ‘Evil Dead: The Game’ Marks 1,300 Games on GeForce NOW
    Good. Bad. You’re the Guy With the Gun this GFN Thursday. Get ready for some horrifyingly good fun with Evil Dead: The Game streaming on GeForce NOW tomorrow at release. It’s the 1,300th game to join GeForce NOW, joining on Friday the 13th. And it’s part of eight total games joining the GeForce NOW library Read article > The post GFN Thursday Gets Groovy As ‘Evil Dead: The Game’ Marks 1,300 Games on GeForce NOW appeared first on NVIDIA Blog.  ( 2 min )
  • Open

    Tool recursion
    “Literature about Lisp rarely resists that narcissistic pleasure of describing Lisp in Lisp.” — Christian Queinnec, Lisp in Small Pieces   Applying software development tools to themselves has a dark side and a light side. There’s a danger of becoming obsessed with one’s tools and never getting around to using them. If it’s your job […] Tool recursion first appeared on John D. Cook.  ( 2 min )
  • Open

    The Baltimore Orioles Effect
    Back when the text-generating neural network GPT-2 was released, OpenAI released it in stages, in part for fear that people might use the more advanced models to generate misinformation. Now in 2022 we do indeed have people passing off AI-written text as human, but rather than being divisive, it’  ( 4 min )
    Bonus: Questionable blog facts
    AI Weirdness: the strange side of machine learning  ( 1 min )
  • Open

    Why Machine Learning Projects Fail- 7 Reasons that can Take Your Efforts for a Ride?
    Your new Machine Learning project is about to fail. Yes, you read that right.  ( 4 min )
  • Open

    Technique protects privacy when making online recommendations
    Researchers devise an efficient protocol to keep a user’s private information secure when algorithms use it to recommend products, songs, or shows.  ( 6 min )
  • Open

    How Do Vision Transformers Work?. (arXiv:2202.06709v3 [cs.CV] UPDATED)
    The success of multi-head self-attentions (MSAs) for computer vision is now indisputable. However, little is known about how MSAs work. We present fundamental explanations to help better understand the nature of MSAs. In particular, we demonstrate the following properties of MSAs and Vision Transformers (ViTs): (1) MSAs improve not only accuracy but also generalization by flattening the loss landscapes. Such improvement is primarily attributable to their data specificity, not long-range dependency. On the other hand, ViTs suffer from non-convex losses. Large datasets and loss landscape smoothing methods alleviate this problem; (2) MSAs and Convs exhibit opposite behaviors. For example, MSAs are low-pass filters, but Convs are high-pass filters. Therefore, MSAs and Convs are complementary; (3) Multi-stage neural networks behave like a series connection of small individual models. In addition, MSAs at the end of a stage play a key role in prediction. Based on these insights, we propose AlterNet, a model in which Conv blocks at the end of a stage are replaced with MSA blocks. AlterNet outperforms CNNs not only in large data regimes but also in small data regimes. The code is available at https://github.com/xxxnell/how-do-vits-work.  ( 2 min )
    Causal Inference Struggles with Agency on Online Platforms. (arXiv:2107.08995v2 [cs.LG] UPDATED)
    Online platforms regularly conduct randomized experiments to understand how changes to the platform causally affect various outcomes of interest. However, experimentation on online platforms has been criticized for having, among other issues, a lack of meaningful oversight and user consent. As platforms give users greater agency, it becomes possible to conduct observational studies in which users self-select into the treatment of interest as an alternative to experiments in which the platform controls whether the user receives treatment or not. In this paper, we conduct four large-scale within-study comparisons on Twitter aimed at assessing the effectiveness of observational studies derived from user self-selection on online platforms. In a within-study comparison, treatment effects from an observational study are assessed based on how effectively they replicate results from a randomized experiment with the same target population. We test the naive difference in group means estimator, exact matching, regression adjustment, and inverse probability of treatment weighting while controlling for plausible confounding variables. In all cases, all observational estimates perform poorly at recovering the ground-truth estimate from the analogous randomized experiments. In all cases except one, the observational estimates have the opposite sign of the randomized estimate. Our results suggest that observational studies derived from user self-selection are a poor alternative to randomized experimentation on online platforms. In discussing our results, we postulate a "Catch-22" that suggests that the success of causal inference in these settings may be at odds with the original motivations for providing users with greater agency.  ( 2 min )
    How Platform-User Power Relations Shape Algorithmic Accountability: A Case Study of Instant Loan Platforms and Financially Stressed Users in India. (arXiv:2205.05661v1 [cs.HC])
    Accountability, a requisite for responsible AI, can be facilitated through transparency mechanisms such as audits and explainability. However, prior work suggests that the success of these mechanisms may be limited to Global North contexts; understanding the limitations of current interventions in varied socio-political conditions is crucial to help policymakers facilitate wider accountability. To do so, we examined the mediation of accountability in the existing interactions between vulnerable users and a 'high-risk' AI system in a Global South setting. We report on a qualitative study with 29 financially-stressed users of instant loan platforms in India. We found that users experienced intense feelings of indebtedness for the 'boon' of instant loans, and perceived huge obligations towards loan platforms. Users fulfilled obligations by accepting harsh terms and conditions, over-sharing sensitive data, and paying high fees to unknown and unverified lenders. Users demonstrated a dependence on loan platforms by persisting with such behaviors despite risks of harms such as abuse, recurring debts, discrimination, privacy harms, and self-harm to them. Instead of being enraged with loan platforms, users assumed responsibility for their negative experiences, thus releasing the high-powered loan platforms from accountability obligations. We argue that accountability is shaped by platform-user power relations, and urge caution to policymakers in adopting a purely technical approach to fostering algorithmic accountability. Instead, we call for situated interventions that enhance agency of users, enable meaningful transparency, reconfigure designer-user relations, and prompt a critical reflection in practitioners towards wider accountability. We conclude with implications for responsibly deploying AI in FinTech applications in India and beyond.  ( 2 min )
    CMOS Circuits for Shape-Based Analog Machine Learning. (arXiv:2202.05022v1 [cs.ET] CROSS LISTED)
    While analog computing is attractive for implementing machine learning (ML) processors, the paradigm requires chip-in-the-loop training for every processor to alleviate artifacts due to device mismatch and device non-linearity. Speeding up chip-in-the-loop training requires re-biasing the circuits in a manner that the analog functions remain invariant across training and inference. In this paper, we present an analog computational paradigm and circuits using "shape" functions that remain invariant to transistor biasing (weak, moderate, and strong inversion) and ambient temperature variation. We show that a core Shape-based Analog Compute (S-AC) circuit could be re-biased and reused to implement: (a) non-linear functions; (b) inner-product building blocks; and (c) a mixed-signal logarithmic memory, all of which are integral towards designing an ML inference processor. Measured results using a prototype fabricated in a 180nm standard CMOS process demonstrate bias invariance and hence the resulting analog designs can be scaled for power and speed like digital logic circuits. We also demonstrate a regression task using these CMOS building blocks.  ( 2 min )
    Incident duration prediction using a bi-level machine learning framework with outlier removal and intra-extra joint optimisation. (arXiv:2205.05197v1 [cs.LG])
    Predicting the duration of traffic incidents is a challenging task due to the stochastic nature of events. The ability to accurately predict how long accidents will last can provide significant benefits to both end-users in their route choice and traffic operation managers in handling of non-recurrent traffic congestion. This paper presents a novel bi-level machine learning framework enhanced with outlier removal and intra-extra joint optimisation for predicting the incident duration on three heterogeneous data sets collected for both arterial roads and motorways from Sydney, Australia and San-Francisco, U.S.A. Firstly, we use incident data logs to develop a binary classification prediction approach, which allows us to classify traffic incidents as short-term or long-term. We find the optimal threshold between short-term versus long-term traffic incident duration, targeting both class balance and prediction performance while also comparing the binary versus multi-class classification approaches. Secondly, for more granularity of the incident duration prediction to the minute level, we propose a new Intra-Extra Joint Optimisation algorithm (IEO-ML) which extends multiple baseline ML models tested against several regression scenarios across the data sets. Final results indicate that: a) 40-45 min is the best split threshold for identifying short versus long-term incidents and that these incidents should be modelled separately, b) our proposed IEO-ML approach significantly outperforms baseline ML models in $66\%$ of all cases showcasing its great potential for accurate incident duration prediction. Lastly, we evaluate the feature importance and show that time, location, incident type, incident reporting source and weather at among the top 10 critical factors which influence how long incidents will last.  ( 2 min )
    A State-Distribution Matching Approach to Non-Episodic Reinforcement Learning. (arXiv:2205.05212v1 [cs.LG])
    While reinforcement learning (RL) provides a framework for learning through trial and error, translating RL algorithms into the real world has remained challenging. A major hurdle to real-world application arises from the development of algorithms in an episodic setting where the environment is reset after every trial, in contrast with the continual and non-episodic nature of the real-world encountered by embodied agents such as humans and robots. Prior works have considered an alternating approach where a forward policy learns to solve the task and the backward policy learns to reset the environment, but what initial state distribution should the backward policy reset the agent to? Assuming access to a few demonstrations, we propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations. This keeps the agent close to the task-relevant states, allowing for a mix of easy and difficult starting states for the forward policy. Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks from the EARL benchmark, with 40% gains on the hardest task, while making fewer assumptions than prior works.  ( 2 min )
    To SMOTE, or not to SMOTE?. (arXiv:2201.08528v3 [cs.LG] UPDATED)
    Balancing the data before training a classifier is a popular technique to address the challenges of imbalanced binary classification in tabular data. Balancing is commonly achieved by duplication of minority samples or by generation of synthetic minority samples. While it is well known that balancing affects each classifier differently, most prior empirical studies did not include strong state-of-the-art (SOTA) classifiers as baselines. In this work, we are interested in understanding whether balancing is beneficial, particularly in the context of SOTA classifiers. Thus, we conduct extensive experiments considering three SOTA classifiers along the weaker learners used in previous investigations. Additionally, we carefully discern proper metrics, consistent and non-consistent algorithms and hyper-parameter selection methods and show that these have a significant impact on prediction quality and on the effectiveness of balancing. Our results support the known utility of balancing for weak classifiers. However, we find that balancing does not improve prediction performance for the strong ones. We further identify several other scenarios for which balancing is effective and observe that prior studies demonstrated the utility of balancing by focusing on these settings.  ( 2 min )
    A Continual Deepfake Detection Benchmark: Dataset, Methods, and Essentials. (arXiv:2205.05467v1 [cs.CV])
    There have been emerging a number of benchmarks and techniques for the detection of deepfakes. However, very few works study the detection of incrementally appearing deepfakes in the real-world scenarios. To simulate the wild scenes, this paper suggests a continual deepfake detection benchmark (CDDB) over a new collection of deepfakes from both known and unknown generative models. The suggested CDDB designs multiple evaluations on the detection over easy, hard, and long sequence of deepfake tasks, with a set of appropriate measures. In addition, we exploit multiple approaches to adapt multiclass incremental learning methods, commonly used in the continual visual recognition, to the continual deepfake detection problem. We evaluate several methods, including the adapted ones, on the proposed CDDB. Within the proposed benchmark, we explore some commonly known essentials of standard continual learning. Our study provides new insights on these essentials in the context of continual deepfake detection. The suggested CDDB is clearly more challenging than the existing benchmarks, which thus offers a suitable evaluation avenue to the future research. Our benchmark dataset and the source code will be made publicly available.  ( 2 min )
    Probability Distribution of Hypervolume Improvement in Bi-objective Bayesian Optimization. (arXiv:2205.05505v1 [cs.LG])
    This work provides the exact expression of the probability distribution of the hypervolume improvement (HVI) for bi-objective generalization of Bayesian optimization. Here, instead of a single-objective improvement, we consider the improvement of the hypervolume indicator concerning the current best approximation of the Pareto front. Gaussian process regression models are trained independently on both objective functions, resulting in a bi-variate separated Gaussian distribution serving as a predictive model for the vector-valued objective function. Some commonly HVI-based acquisition functions (probability of improvement and upper confidence bound) are also leveraged with the help of the exact distribution of HVI. In addition, we show the superior numerical accuracy and efficiency of the exact distribution compared to the commonly used approximation by Monte-Carlo sampling. Finally, we benchmark distribution-leveraged acquisition functions on the widely applied ZDT problem set, demonstrating a significant advantage of using the exact distribution of HVI in multi-objective Bayesian optimization.  ( 2 min )
    Efficient Risk-Averse Reinforcement Learning. (arXiv:2205.05138v1 [cs.LG])
    In risk-averse reinforcement learning (RL), the goal is to optimize some risk measure of the returns. A risk measure often focuses on the worst returns out of the agent's experience. As a result, standard methods for risk-averse RL often ignore high-return strategies. We prove that under certain conditions this inevitably leads to a local-optimum barrier, and propose a soft risk mechanism to bypass it. We also devise a novel Cross Entropy module for risk sampling, which (1) preserves risk aversion despite the soft risk; (2) independently improves sample efficiency. By separating the risk aversion of the sampler and the optimizer, we can sample episodes with poor conditions, yet optimize with respect to successful strategies. We combine these two concepts in CeSoR - Cross-entropy Soft-Risk optimization algorithm - which can be applied on top of any risk-averse policy gradient (PG) method. We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks, including in scenarios where standard risk-averse PG completely fails.  ( 2 min )
    Developing cooperative policies for multi-stage reinforcement learning tasks. (arXiv:2205.05230v1 [cs.LG])
    Many hierarchical reinforcement learning algorithms utilise a series of independent skills as a basis to solve tasks at a higher level of reasoning. These algorithms don't consider the value of using skills that are cooperative instead of independent. This paper proposes the Cooperative Consecutive Policies (CCP) method of enabling consecutive agents to cooperatively solve long time horizon multi-stage tasks. This method is achieved by modifying the policy of each agent to maximise both the current and next agent's critic. Cooperatively maximising critics allows each agent to take actions that are beneficial for its task as well as subsequent tasks. Using this method in a multi-room maze domain and a peg in hole manipulation domain, the cooperative policies were able to outperform a set of naive policies, a single agent trained across the entire domain, as well as another sequential HRL algorithm.  ( 2 min )
    AutoKE: An automatic knowledge embedding framework for scientific machine learning. (arXiv:2205.05390v1 [cs.LG])
    Imposing physical constraints on neural networks as a method of knowledge embedding has achieved great progress in solving physical problems described by governing equations. However, for many engineering problems, governing equations often have complex forms, including complex partial derivatives or stochastic physical fields, which results in significant inconveniences from the perspective of implementation. In this paper, a scientific machine learning framework, called AutoKE, is proposed, and a reservoir flow problem is taken as an instance to demonstrate that this framework can effectively automate the process of embedding physical knowledge. In AutoKE, an emulator comprised of deep neural networks (DNNs) is built for predicting the physical variables of interest. An arbitrarily complex equation can be parsed and automatically converted into a computational graph through the equation parser module, and the fitness of the emulator to the governing equation is evaluated via automatic differentiation. Furthermore, the fixed weights in the loss function are substituted with adaptive weights by incorporating the Lagrangian dual method. Neural architecture search (NAS) is also introduced into the AutoKE to select an optimal network architecture of the emulator according to the specific problem. Finally, we apply transfer learning to enhance the scalability of the emulator. In experiments, the framework is verified by a series of physical problems in which it can automatically embed physical knowledge into an emulator without heavy hand-coding. The results demonstrate that the emulator can not only make accurate predictions, but also be applied to similar problems with high efficiency via transfer learning.  ( 2 min )
    Robustness of Humans and Machines on Object Recognition with Extreme Image Transformations. (arXiv:2205.05167v1 [cs.CV])
    Recent neural network architectures have claimed to explain data from the human visual cortex. Their demonstrated performance is however still limited by the dependence on exploiting low-level features for solving visual tasks. This strategy limits their performance in case of out-of-distribution/adversarial data. Humans, meanwhile learn abstract concepts and are mostly unaffected by even extreme image distortions. Humans and networks employ strikingly different strategies to solve visual tasks. To probe this, we introduce a novel set of image transforms and evaluate humans and networks on an object recognition task. We found performance for a few common networks quickly decreases while humans are able to recognize objects with a high accuracy.  ( 2 min )
    AutoTransfer: Subject Transfer Learning with Censored Representations on Biosignals Data. (arXiv:2112.09796v2 [cs.LG] UPDATED)
    We provide a regularization framework for subject transfer learning in which we seek to train an encoder and classifier to minimize classification loss, subject to a penalty measuring independence between the latent representation and the subject label. We introduce three notions of independence and corresponding penalty terms using mutual information or divergence as a proxy for independence. For each penalty term, we provide several concrete estimation algorithms, using analytic methods as well as neural critic functions. We provide a hands-off strategy for applying this diverse family of regularization algorithms to a new dataset, which we call "AutoTransfer". We evaluate the performance of these individual regularization strategies and our AutoTransfer method on EEG, EMG, and ECoG datasets, showing that these approaches can improve subject transfer learning for challenging real-world datasets.  ( 2 min )
    A Machine Learning Analysis of COVID-19 Mental Health Data. (arXiv:2112.00227v2 [cs.LG] UPDATED)
    In late December 2019, the novel coronavirus (Sars-Cov-2) and the resulting disease COVID-19 were first identified in Wuhan China. The disease slipped through containment measures, with the first known case in the United States being identified on January 20th, 2020. In this paper, we utilize survey data from the Inter-university Consortium for Political and Social Research and apply several statistical and machine learning models and techniques such as Decision Trees, Multinomial Logistic Regression, Naive Bayes, k-Nearest Neighbors, Support Vector Machines, Neural Networks, Random Forests, Gradient Tree Boosting, XGBoost, CatBoost, LightGBM, Synthetic Minority Oversampling, and Chi-Squared Test to analyze the impacts the COVID-19 pandemic has had on the mental health of frontline workers in the United States. Through the interpretation of the many models applied to the mental health survey data, we have concluded that the most important factor in predicting the mental health decline of a frontline worker is the healthcare role the individual is in (Nurse, Emergency Room Staff, Surgeon, etc.), followed by the amount of sleep the individual has had in the last week, the amount of COVID-19 related news an individual has consumed on average in a day, the age of the worker, and the usage of alcohol and cannabis.  ( 2 min )
    Multifidelity data fusion in convolutional encoder/decoder networks. (arXiv:2205.05187v1 [cs.LG])
    We analyze the regression accuracy of convolutional neural networks assembled from encoders, decoders and skip connections and trained with multifidelity data. Besides requiring significantly less trainable parameters than equivalent fully connected networks, encoder, decoder, encoder-decoder or decoder-encoder architectures can learn the mapping between inputs to outputs of arbitrary dimensionality. We demonstrate their accuracy when trained on a few high-fidelity and many low-fidelity data generated from models ranging from one-dimensional functions to Poisson equation solvers in two-dimensions. We finally discuss a number of implementation choices that improve the reliability of the uncertainty estimates generated by Monte Carlo DropBlocks, and compare uncertainty estimates among low-, high- and multifidelity approaches.  ( 2 min )
    Unsupervised machine learning for physical concepts. (arXiv:2205.05279v1 [cs.LG])
    In recent years, machine learning methods have been used to assist scientists in scientific research. Human scientific theories are based on a series of concepts. How machine learns the concepts from experimental data will be an important first step. We propose a hybrid method to extract interpretable physical concepts through unsupervised machine learning. This method consists of two stages. At first, we need to find the Betti numbers of experimental data. Secondly, given the Betti numbers, we use a variational autoencoder network to extract meaningful physical variables. We test our protocol on toy models and show how it works.  ( 2 min )
    Learning and Evaluating Graph Neural Network Explanations based on Counterfactual and Factual Reasoning. (arXiv:2202.08816v3 [cs.IR] UPDATED)
    Structural data well exists in Web applications, such as social networks in social media, citation networks in academic websites, and threads data in online forums. Due to the complex topology, it is difficult to process and make use of the rich information within such data. Graph Neural Networks (GNNs) have shown great advantages on learning representations for structural data. However, the non-transparency of the deep learning models makes it non-trivial to explain and interpret the predictions made by GNNs. Meanwhile, it is also a big challenge to evaluate the GNN explanations, since in many cases, the ground-truth explanations are unavailable. In this paper, we take insights of Counterfactual and Factual (CF^2) reasoning from causal inference theory, to solve both the learning and evaluation problems in explainable GNNs. For generating explanations, we propose a model-agnostic framework by formulating an optimization problem based on both of the two casual perspectives. This distinguishes CF^2 from previous explainable GNNs that only consider one of them. Another contribution of the work is the evaluation of GNN explanations. For quantitatively evaluating the generated explanations without the requirement of ground-truth, we design metrics based on Counterfactual and Factual reasoning to evaluate the necessity and sufficiency of the explanations. Experiments show that no matter ground-truth explanations are available or not, CF^2 generates better explanations than previous state-of-the-art methods on real-world datasets. Moreover, the statistic analysis justifies the correlation between the performance on ground-truth evaluation and our proposed metrics. Source code is available at https://github.com/chrisjtan/gnn_cff.
    Self-Supervised Anomaly Detection: A Survey and Outlook. (arXiv:2205.05173v1 [cs.LG])
    Over the past few years, anomaly detection, a subfield of machine learning that is mainly concerned with the detection of rare events, witnessed an immense improvement following the unprecedented growth of deep learning models. Recently, the emergence of self-supervised learning has sparked the development of new anomaly detection algorithms that surpassed state-of-the-art accuracy by a significant margin. This paper aims to review the current approaches in self-supervised anomaly detection. We present technical details of the common approaches and discuss their strengths and drawbacks. We also compare the performance of these models against each other and other state-of-the-art anomaly detection models. Finally, we discuss a variety of new directions for improving the existing algorithms.  ( 2 min )
    Keep Your Friends Close and Your Counterfactuals Closer: Improved Learning From Closest Rather Than Plausible Counterfactual Explanations in an Abstract Setting. (arXiv:2205.05515v1 [cs.AI])
    Counterfactual explanations (CFEs) highlight what changes to a model's input would have changed its prediction in a particular way. CFEs have gained considerable traction as a psychologically grounded solution for explainable artificial intelligence (XAI). Recent innovations introduce the notion of computational plausibility for automatically generated CFEs, enhancing their robustness by exclusively creating plausible explanations. However, practical benefits of such a constraint on user experience and behavior is yet unclear. In this study, we evaluate objective and subjective usability of computationally plausible CFEs in an iterative learning design targeting novice users. We rely on a novel, game-like experimental design, revolving around an abstract scenario. Our results show that novice users actually benefit less from receiving computationally plausible rather than closest CFEs that produce minimal changes leading to the desired outcome. Responses in a post-game survey reveal no differences in terms of subjective user experience between both groups. Following the view of psychological plausibility as comparative similarity, this may be explained by the fact that users in the closest condition experience their CFEs as more psychologically plausible than the computationally plausible counterpart. In sum, our work highlights a little-considered divergence of definitions of computational plausibility and psychological plausibility, critically confirming the need to incorporate human behavior, preferences and mental models already at the design stages of XAI approaches. In the interest of reproducible research, all source code, acquired user data, and evaluation scripts of the current study are available: https://github.com/ukuhl/PlausibleAlienZoo
    ConfLab: A Rich Multimodal Multisensor Dataset of Free-Standing Social Interactions In-the-Wild. (arXiv:2205.05177v1 [cs.MM])
    We describe an instantiation of a new concept for multimodal multisensor data collection of real life in-the-wild free standing social interactions in the form of a Conference Living Lab (ConfLab). ConfLab contains high fidelity data of 49 people during a real-life professional networking event capturing a diverse mix of status, acquaintanceship, and networking motivations at an international conference. Recording such a dataset is challenging due to the delicate trade-off between participant privacy and fidelity of the data, and the technical and logistic challenges involved. We improve upon prior datasets in the fidelity of most of our modalities: 8-camera overhead setup, personal wearable sensors recording body motion (9-axis IMU), Bluetooth-based proximity, and low-frequency audio. Additionally, we use a state-of-the-art hardware synchronization solution and time-efficient continuous technique for annotating body keypoints and actions at high frequencies. We argue that our improvements are essential for a deeper study of interaction dynamics at finer time scales. Our research tasks showcase some of the open challenges related to in-the-wild privacy-preserving social data analysis: keypoints detection from overhead camera views, skeleton based no-audio speaker detection, and F-formation detection. With the ConfLab dataset, we aim to bridge the gap between traditional computer vision tasks and in-the-wild ecologically valid socially-motivated tasks.  ( 2 min )
    Exploring Local Explanations of Nonlinear Models Using Animated Linear Projections. (arXiv:2205.05359v1 [stat.ML])
    The increased predictive power of nonlinear models comes at the cost of interpretability of its terms. This trade-off has led to the emergence of eXplainable AI (XAI). XAI attempts to shed light on how models use predictors to arrive at a prediction with local explanations, a point estimate of the linear feature importance in the vicinity of one instance. These can be considered linear projections and can be further explored to understand better the interactions between features used to make predictions across the predictive model surface. Here we describe interactive linear interpolation used for exploration at any instance and illustrate with examples with categorical (penguin species, chocolate types) and quantitative (soccer/football salaries, house prices) output. The methods are implemented in the R package cheem, available on CRAN.  ( 2 min )
    An Empirical Study Of Self-supervised Learning Approaches For Object Detection With Transformers. (arXiv:2205.05543v1 [cs.CV])
    Self-supervised learning (SSL) methods such as masked language modeling have shown massive performance gains by pretraining transformer models for a variety of natural language processing tasks. The follow-up research adapted similar methods like masked image modeling in vision transformer and demonstrated improvements in the image classification task. Such simple self-supervised methods are not exhaustively studied for object detection transformers (DETR, Deformable DETR) as their transformer encoder modules take input in the convolutional neural network (CNN) extracted feature space rather than the image space as in general vision transformers. However, the CNN feature maps still maintain the spatial relationship and we utilize this property to design self-supervised learning approaches to train the encoder of object detection transformers in pretraining and multi-task learning settings. We explore common self-supervised methods based on image reconstruction, masked image modeling and jigsaw. Preliminary experiments in the iSAID dataset demonstrate faster convergence of DETR in the initial epochs in both pretraining and multi-task learning settings; nonetheless, similar improvement is not observed in the case of multi-task learning with Deformable DETR. The code for our experiments with DETR and Deformable DETR are available at https://github.com/gokulkarthik/detr and https://github.com/gokulkarthik/Deformable-DETR respectively.
    Detecting Emerging Technologies and their Evolution using Deep Learning and Weak Signal Analysis. (arXiv:2205.05449v1 [cs.AI])
    Emerging technologies can have major economic impacts and affect strategic stability. Yet, early identification of emerging technologies remains challenging. In order to identify emerging technologies in a timely and reliable manner, a comprehensive examination of relevant scientific and technological (S&T) trends and their related references is required. This examination is generally done by domain experts and requires significant amounts of time and effort to gain insights. The use of domain experts to identify emerging technologies from S&T trends may limit the capacity to analyse large volumes of information and introduce subjectivity in the assessments. Decision support systems are required to provide accurate and reliable evidence-based indicators through constant and continuous monitoring of the environment and help identify signals of emerging technologies that could alter security and economic prosperity. For example, the research field of hypersonics has recently witnessed several advancements having profound technological, commercial, and national security implications. In this work, we present a multi-layer quantitative approach able to identify future signs from scientific publications on hypersonics by leveraging deep learning and weak signal analysis. The proposed framework can help strategic planners and domain experts better identify and monitor emerging technology trends.
    A Framework for Machine Learning of Model Error in Dynamical Systems. (arXiv:2107.06658v2 [math.DS] UPDATED)
    The development of data-informed predictive models for dynamical systems is of widespread interest in many disciplines. We present a unifying framework for blending mechanistic and machine-learning approaches to identify dynamical systems from noisily and partially observed data. We compare pure data-driven learning with hybrid models which incorporate imperfect domain knowledge. Our formulation is agnostic to the chosen machine learning model, is presented in both continuous- and discrete-time settings, and is compatible both with model errors that exhibit substantial memory and errors that are memoryless. First, we study memoryless linear (w.r.t. parametric-dependence) model error from a learning theory perspective, defining excess risk and generalization error. For ergodic continuous-time systems, we prove that both excess risk and generalization error are bounded above by terms that diminish with the square-root of T, the time-interval over which training data is specified. Secondly, we study scenarios that benefit from modeling with memory, proving universal approximation theorems for two classes of continuous-time recurrent neural networks (RNNs): both can learn memory-dependent model error. In addition, we connect one class of RNNs to reservoir computing, thereby relating learning of memory-dependent error to recent work on supervised learning between Banach spaces using random features. Numerical results are presented (Lorenz '63, Lorenz '96 Multiscale systems) to compare purely data-driven and hybrid approaches, finding hybrid methods less data-hungry and more parametrically efficient. Finally, we demonstrate numerically how data assimilation can be leveraged to learn hidden dynamics from noisy, partially-observed data, and illustrate challenges in representing memory by this approach, and in the training of such models.
    A globally convergent fast iterative shrinkage-thresholding algorithm with a new momentum factor for single and multi-objective convex optimization. (arXiv:2205.05262v1 [math.OC])
    Convex-composite optimization, which minimizes an objective function represented by the sum of a differentiable function and a convex one, is widely used in machine learning and signal/image processing. Fast Iterative Shrinkage Thresholding Algorithm (FISTA) is a typical method for solving this problem and has a global convergence rate of $O(1 / k^2)$. Recently, this has been extended to multi-objective optimization, together with the proof of the $O(1 / k^2)$ global convergence rate. However, its momentum factor is classical, and the convergence of its iterates has not been proven. In this work, introducing some additional hyperparameters $(a, b)$, we propose another accelerated proximal gradient method with a general momentum factor, which is new even for the single-objective cases. We show that our proposed method also has a global convergence rate of $O(1/k^2)$ for any $(a,b)$, and further that the generated sequence of iterates converges to a weak Pareto solution when $a$ is positive, an essential property for the finite-time manifold identification. Moreover, we report numerical results with various $(a,b)$, showing that some of these choices give better results than the classical momentum factors.
    Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. (arXiv:2205.05628v1 [cs.CR])
    With the increasing prevalence of encrypted network traffic, cyber security analysts have been turning to machine learning (ML) techniques to elucidate the traffic on their networks. However, ML models can become stale as known traffic features can shift between networks and as new traffic emerges that is outside of the distribution of the training set. In order to reliably adapt in this dynamic environment, ML models must additionally provide contextualized uncertainty quantification to their predictions, which has received little attention in the cyber security domain. Uncertainty quantification is necessary both to signal when the model is uncertain about which class to choose in its label assignment and when the traffic is not likely to belong to any pre-trained classes. We present a new, public dataset of network traffic that includes labeled, Virtual Private Network (VPN)-encrypted network traffic generated by 10 applications and corresponding to 5 application categories. We also present an ML framework that is designed to rapidly train with modest data requirements and provide both calibrated, predictive probabilities as well as an interpretable ``out-of-distribution'' (OOD) score to flag novel traffic samples. We describe how to compute a calibrated OOD score from p-values of the so-called relative Mahalanobis distance. We demonstrate that our framework achieves an F1 score of 0.98 on our dataset and that it can extend to an enterprise network by testing the model: (1) on data from similar applications, (2) on dissimilar application traffic from an existing category, and (3) on application traffic from a new category. The model correctly flags uncertain traffic and, upon retraining, accurately incorporates the new data. We additionally demonstrate good performance (F1 score of 0.97) when packet sizes are made to be uniform, as occurs for certain encryption protocols.
    Automatic Tuberculosis and COVID-19 cough classification using deep learning. (arXiv:2205.05480v1 [cs.LG])
    We present a deep learning based automatic cough classifier which can discriminate tuberculosis (TB) coughs from COVID-19 coughs and healthy coughs. Both TB and COVID-19 are respiratory disease, have cough as a predominant symptom and claim thousands of lives each year. The cough audio recordings were collected at both indoor and outdoor settings and also uploaded using smartphones from subjects around the globe, thus contain various levels of noise. This cough data include 1.68 hours of TB coughs, 18.54 minutes of COVID-19 coughs and 1.69 hours of healthy coughs from 47 TB patients, 229 COVID-19 patients and 1498 healthy patients and were used to train and evaluate a CNN, LSTM and Resnet50. These three deep architectures were also pre-trained on 2.14 hours of sneeze, 2.91 hours of speech and 2.79 hours of noise for improved performance. The class-imbalance in our dataset was addressed by using SMOTE data balancing technique and using performance metrics such as F1-score and AUC. Our study shows that the highest F1-scores of 0.9259 and 0.8631 have been achieved from a pre-trained Resnet50 for two-class (TB vs COVID-19) and three-class (TB vs COVID-19 vs healthy) cough classification tasks, respectively. The application of deep transfer learning has improved the classifiers' performance and makes them more robust as they generalise better over the cross-validation folds. Their performances exceed the TB triage test requirements set by the world health organisation (WHO). The features producing the best performance contain higher order of MFCCs suggesting that the differences between TB and COVID-19 coughs are not perceivable by the human ear. This type of cough audio classification is non-contact, cost-effective and can easily be deployed on a smartphone, thus it can be an excellent tool for both TB and COVID-19 screening.
    FairNeuron: Improving Deep Neural Network Fairness with Adversary Games on Selective Neurons. (arXiv:2204.02567v2 [cs.LG] UPDATED)
    With Deep Neural Network (DNN) being integrated into a growing number of critical systems with far-reaching impacts on society, there are increasing concerns on their ethical performance, such as fairness. Unfortunately, model fairness and accuracy in many cases are contradictory goals to optimize. To solve this issue, there has been a number of work trying to improve model fairness by using an adversarial game in model level. This approach introduces an adversary that evaluates the fairness of a model besides its prediction accuracy on the main task, and performs joint-optimization to achieve a balanced result. In this paper, we noticed that when performing backward propagation based training, such contradictory phenomenon has shown on individual neuron level. Based on this observation, we propose FairNeuron, a DNN model automatic repairing tool, to mitigate fairness concerns and balance the accuracy-fairness trade-off without introducing another model. It works on detecting neurons with contradictory optimization directions from accuracy and fairness training goals, and achieving a trade-off by selective dropout. Comparing with state-of-the-art methods, our approach is lightweight, making it scalable and more efficient. Our evaluation on 3 datasets shows that FairNeuron can effectively improve all models' fairness while maintaining a stable utility.
    Subspace Learning Machine (SLM): Methodology and Performance. (arXiv:2205.05296v1 [cs.LG])
    Inspired by the feedforward multilayer perceptron (FF-MLP), decision tree (DT) and extreme learning machine (ELM), a new classification model, called the subspace learning machine (SLM), is proposed in this work. SLM first identifies a discriminant subspace, $S^0$, by examining the discriminant power of each input feature. Then, it uses probabilistic projections of features in $S^0$ to yield 1D subspaces and finds the optimal partition for each of them. This is equivalent to partitioning $S^0$ with hyperplanes. A criterion is developed to choose the best $q$ partitions that yield $2q$ partitioned subspaces among them. We assign $S^0$ to the root node of a decision tree and the intersections of $2q$ subspaces to its child nodes of depth one. The partitioning process is recursively applied at each child node to build an SLM tree. When the samples at a child node are sufficiently pure, the partitioning process stops and each leaf node makes a prediction. The idea can be generalized to regression, leading to the subspace learning regressor (SLR). Furthermore, ensembles of SLM/SLR trees can yield a stronger predictor. Extensive experiments are conducted for performance benchmarking among SLM/SLR trees, ensembles and classical classifiers/regressors.
    Robust Data-Driven Output Feedback Control via Bootstrapped Multiplicative Noise. (arXiv:2205.05119v1 [eess.SY])
    We propose a robust data-driven output feedback control algorithm that explicitly incorporates inherent finite-sample model estimate uncertainties into the control design. The algorithm has three components: (1) a subspace identification nominal model estimator; (2) a bootstrap resampling method that quantifies non-asymptotic variance of the nominal model estimate; and (3) a non-conventional robust control design method comprising a coupled optimal dynamic output feedback filter and controller with multiplicative noise. A key advantage of the proposed approach is that the system identification and robust control design procedures both use stochastic uncertainty representations, so that the actual inherent statistical estimation uncertainty directly aligns with the uncertainty the robust controller is being designed against. Moreover, the control design method accommodates a highly structured uncertainty representation that can capture uncertainty shape more effectively than existing approaches. We show through numerical experiments that the proposed robust data-driven output feedback controller can significantly outperform a certainty equivalent controller on various measures of sample complexity and stability robustness.
    Ranked Prioritization of Groups in Combinatorial Bandit Allocation. (arXiv:2205.05659v1 [cs.AI])
    Preventing poaching through ranger patrols protects endangered wildlife, directly contributing to the UN Sustainable Development Goal 15 of life on land. Combinatorial bandits have been used to allocate limited patrol resources, but existing approaches overlook the fact that each location is home to multiple species in varying proportions, so a patrol benefits each species to differing degrees. When some species are more vulnerable, we ought to offer more protection to these animals; unfortunately, existing combinatorial bandit approaches do not offer a way to prioritize important species. To bridge this gap, (1) We propose a novel combinatorial bandit objective that trades off between reward maximization and also accounts for prioritization over species, which we call ranked prioritization. We show this objective can be expressed as a weighted linear sum of Lipschitz-continuous reward functions. (2) We provide RankedCUCB, an algorithm to select combinatorial actions that optimize our prioritization-based objective, and prove that it achieves asymptotic no-regret. (3) We demonstrate empirically that RankedCUCB leads to up to 38% improvement in outcomes for endangered species using real-world wildlife conservation data. Along with adapting to other challenges such as preventing illegal logging and overfishing, our no-regret algorithm addresses the general combinatorial bandit problem with a weighted linear objective.
    Stable and Interpretable Unrolled Dictionary Learning. (arXiv:2106.00058v4 [cs.LG] UPDATED)
    The dictionary learning problem, representing data as a combination of a few atoms, has long stood as a popular method for learning representations in statistics and signal processing. The most popular dictionary learning algorithm alternates between sparse coding and dictionary update steps, and a rich literature has studied its theoretical convergence. The success of dictionary learning relies on access to a ``good'' initial estimate of the dictionary and the ability of the sparse coding step to provide an unbiased estimate of the code. The growing popularity of unrolled sparse coding networks has led to the empirical finding that backpropagation through such networks performs dictionary learning. We offer the first theoretical analysis of these empirical results through PUDLE, a Provable Unrolled Dictionary LEarning method. We provide conditions on the network initialization and data distribution sufficient to recover and preserve the support of the latent sparse representation. Additionally, we address two challenges; first, the vanilla unrolled sparse coding computes a biased code estimate, and second, gradients during backpropagated learning can become unstable. We show approaches to reduce the bias of the code estimate in the forward pass, and that of the dictionary estimate in the backward pass. We propose strategies to resolve the learning instability. This is achieved by tuning network parameters and modifying the loss function. Overall, we highlight the impact of loss, unrolling, and backpropagation on convergence. We complement our findings through synthetic and image denoising experiments. Finally, we demonstrate PUDLE's interpretability, a driving factor in designing deep networks based on iterative optimizations, by building a mathematical relation between network weights, its output, and the training set.
    Analysis of convolutional neural network image classifiers in a rotationally symmetric model. (arXiv:2205.05500v1 [stat.ML])
    Convolutional neural network image classifiers are defined and the rate of convergence of the misclassification risk of the estimates towards the optimal misclassification risk is analyzed. Here we consider images as random variables with values in some functional space, where we only observe discrete samples as function values on some finite grid. Under suitable structural and smoothness assumptions on the functional a posteriori probability, which includes some kind of symmetry against rotation of subparts of the input image, it is shown that least squares plug-in classifiers based on convolutional neural networks are able to circumvent the curse of dimensionality in binary image classification if we neglect a resolution-dependent error term. The finite sample size behavior of the classifier is analyzed by applying it to simulated and real data.
    Contextual Search in the Presence of Adversarial Corruptions. (arXiv:2002.11650v5 [cs.LG] UPDATED)
    We study contextual search, a generalization of binary search in higher dimensions, which captures settings such as feature-based dynamic pricing. Standard formulations of this problem assume that agents act in accordance with a specific homogeneous response model. In practice, however, some responses may be adversarially corrupted. Existing algorithms heavily depend on the assumed response model being (approximately) accurate for all agents and have poor performance in the presence of even a few such arbitrary misspecifications. We initiate the study of contextual search when some of the agents can behave in ways inconsistent with the underlying response model. In particular, we provide two algorithms, one based on multidimensional binary search methods and one based on gradient descent. We show that these algorithms attain near-optimal regret in the absence of adversarial corruptions and their performance degrades gracefully with the number of such agents, providing the first results for contextual search in any adversarial noise model. Our techniques draw inspiration from learning theory, game theory, high-dimensional geometry, and convex analysis.
    Autoencoder Attractors for Uncertainty Estimation. (arXiv:2204.00382v2 [cs.LG] UPDATED)
    The reliability assessment of a machine learning model's prediction is an important quantity for the deployment in safety critical applications. Not only can it be used to detect novel sceneries, either as out-of-distribution or anomaly sample, but it also helps to determine deficiencies in the training data distribution. A lot of promising research directions have either proposed traditional methods like Gaussian processes or extended deep learning based approaches, for example, by interpreting them from a Bayesian point of view. In this work we propose a novel approach for uncertainty estimation based on autoencoder models: The recursive application of a previously trained autoencoder model can be interpreted as a dynamical system storing training examples as attractors. While input images close to known samples will converge to the same or similar attractor, input samples containing unknown features are unstable and converge to different training samples by potentially removing or changing characteristic features. The use of dropout during training and inference leads to a family of similar dynamical systems, each one being robust on samples close to the training distribution but unstable on new features. Either the model reliably removes these features or the resulting instability can be exploited to detect problematic input samples. We evaluate our approach on several dataset combinations as well as on an industrial application for occupant classification in the vehicle interior for which we additionally release a new synthetic dataset.
    Accelerated Reinforcement Learning for Temporal Logic Control Objectives. (arXiv:2205.04424v2 [cs.RO] UPDATED)
    This paper addresses the problem of learning control policies for mobile robots modeled as unknown Markov Decision Processes (MDPs) that are tasked with temporal logic missions, such as sequencing, coverage, or surveillance. The MDP captures uncertainty in the workspace structure and the outcomes of control decisions. The control objective is to synthesize a control policy that maximizes the probability of accomplishing a high-level task, specified as a Linear Temporal Logic (LTL) formula. To address this problem, we propose a novel accelerated model-based reinforcement learning (RL) algorithm for LTL control objectives that is capable of learning control policies significantly faster than related approaches. Its sample-efficiency relies on biasing exploration towards directions that may contribute to task satisfaction. This is accomplished by leveraging an automaton representation of the LTL task as well as a continuously learned MDP model. Finally, we provide extensive comparative experiments that demonstrate the sample efficiency of the proposed method against recent temporal logic RL methods.
    Machine Learning to Support Triage of Children at Risk for Epileptic Seizures in the Pediatric Intensive Care Unit. (arXiv:2205.05389v1 [cs.LG])
    Objective: Epileptic seizures are relatively common in critically-ill children admitted to the pediatric intensive care unit (PICU) and thus serve as an important target for identification and treatment. Most of these seizures have no discernible clinical manifestation but still have a significant impact on morbidity and mortality. Children that are deemed at risk for seizures within the PICU are monitored using continuous-electroencephalogram (cEEG). cEEG monitoring cost is considerable and as the number of available machines is always limited, clinicians need to resort to triaging patients according to perceived risk in order to allocate resources. This research aims to develop a computer aided tool to improve seizures risk assessment in critically-ill children, using an ubiquitously recorded signal in the PICU, namely the electrocardiogram (ECG). Approach: A novel data-driven model was developed at a patient-level approach, based on features extracted from the first hour of ECG recording and the clinical data of the patient. Main results: The most predictive features were the age of the patient, the brain injury as coma etiology and the QRS area. For patients without any prior clinical data, using one hour of ECG recording, the classification performance of the random forest classifier reached an area under the receiver operating characteristic curve (AUROC) score of 0.84. When combining ECG features with the patients clinical history, the AUROC reached 0.87. Significance: Taking a real clinical scenario, we estimated that our clinical decision support triage tool can improve the positive predictive value by more than 59% over the clinical standard.
    An Inexact Augmented Lagrangian Algorithm for Training Leaky ReLU Neural Network with Group Sparsity. (arXiv:2205.05428v1 [math.OC])
    The leaky ReLU network with a group sparse regularization term has been widely used in the recent years. However, training such a network yields a nonsmooth nonconvex optimization problem and there exists a lack of approaches to compute a stationary point deterministically. In this paper, we first resolve the multi-layer composite term in the original optimization problem by introducing auxiliary variables and additional constraints. We show the new model has a nonempty and bounded solution set and its feasible set satisfies the Mangasarian-Fromovitz constraint qualification. Moreover, we show the relationship between the new model and the original problem. Remarkably, we propose an inexact augmented Lagrangian algorithm for solving the new model and show the convergence of the algorithm to a KKT point. Numerical experiments demonstrate that our algorithm is more efficient for training sparse leaky ReLU neural networks than some well-known algorithms.
    Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models. (arXiv:2112.00029v2 [cs.LG] UPDATED)
    Overparameterized neural networks generalize well but are expensive to train. Ideally, one would like to reduce their computational cost while retaining their generalization benefits. Sparse model training is a simple and promising approach to achieve this, but there remain challenges as existing methods struggle with accuracy loss, slow training runtime, or difficulty in sparsifying all model components. The core problem is that searching for a sparsity mask over a discrete set of sparse matrices is difficult and expensive. To address this, our main insight is to optimize over a continuous superset of sparse matrices with a fixed structure known as products of butterfly matrices. As butterfly matrices are not hardware efficient, we propose simple variants of butterfly (block and flat) to take advantage of modern hardware. Our method (Pixelated Butterfly) uses a simple fixed sparsity pattern based on flat block butterfly and low-rank matrices to sparsify most network layers (e.g., attention, MLP). We empirically validate that Pixelated Butterfly is 3x faster than butterfly and speeds up training to achieve favorable accuracy--efficiency tradeoffs. On the ImageNet classification and WikiText-103 language modeling tasks, our sparse models train up to 2.5x faster than the dense MLP-Mixer, Vision Transformer, and GPT-2 medium with no drop in accuracy.
    Hitting time for Markov decision process. (arXiv:2205.03476v2 [cs.LG] UPDATED)
    We define the hitting time for a Markov decision process (MDP). We do not use the hitting time of the Markov process induced by the MDP because the induced chain may not have a stationary distribution. Even it has a stationary distribution, the stationary distribution may not coincide with the (normalized) occupancy measure of the MDP. We observe a relationship between the MDP and the PageRank. Using this observation, we construct an MP whose stationary distribution coincides with the normalized occupancy measure of the MDP and we define the hitting time of the MDP as the hitting time of the associated MP.
    Reducing Activation Recomputation in Large Transformer Models. (arXiv:2205.05198v1 [cs.LG])
    Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation. Activation recomputation is commonly used to work around memory capacity constraints. Rather than storing activations for backpropagation, they are traditionally recomputed, which saves memory but adds redundant compute. In this work, we show most of this redundant compute is unnecessary because we can reduce memory consumption sufficiently without it. We present two novel yet very simple techniques: sequence parallelism and selective activation recomputation. In conjunction with tensor parallelism, these techniques almost eliminate the need to recompute activations. We evaluate our approach on language models up to one trillion parameters in scale and show that our method reduces activation memory by 5x, while reducing execution time overhead from activation recomputation by over 90%. For example, when training a 530B parameter GPT-3 style model on 2240 NVIDIA A100 GPUs, we achieve a Model Flops Utilization of 54.2%, which is 29% faster than the 42.1% we achieve using recomputation. Our implementation will be available in both Megatron-LM and NeMo-Megatron.
    Pre-trained Language Models as Re-Annotators. (arXiv:2205.05368v1 [cs.CL])
    Annotation noise is widespread in datasets, but manually revising a flawed corpus is time-consuming and error-prone. Hence, given the prior knowledge in Pre-trained Language Models and the expected uniformity across all annotations, we attempt to reduce annotation noise in the corpus through two tasks automatically: (1) Annotation Inconsistency Detection that indicates the credibility of annotations, and (2) Annotation Error Correction that rectifies the abnormal annotations. We investigate how to acquire semantic sensitive annotation representations from Pre-trained Language Models, expecting to embed the examples with identical annotations to the mutually adjacent positions even without fine-tuning. We proposed a novel credibility score to reveal the likelihood of annotation inconsistencies based on the neighbouring consistency. Then, we fine-tune the Pre-trained Language Models based classifier with cross-validation for annotation correction. The annotation corrector is further elaborated with two approaches: (1) soft labelling by Kernel Density Estimation and (2) a novel distant-peer contrastive loss. We study the re-annotation in relation extraction and create a new manually revised dataset, Re-DocRED, for evaluating document-level re-annotation. The proposed credibility scores show promising agreement with human revisions, achieving a Binary F1 of 93.4 and 72.5 in detecting inconsistencies on TACRED and DocRED respectively. Moreover, the neighbour-aware classifiers based on distant-peer contrastive learning and uncertain labels achieve Macro F1 up to 66.2 and 57.8 in correcting annotations on TACRED and DocRED respectively. These improvements are not merely theoretical: Rather, automatically denoised training sets demonstrate up to 3.6% performance improvement for state-of-the-art relation extraction models.
    The First Optimal Algorithm for Smooth and Strongly-Convex-Strongly-Concave Minimax Optimization. (arXiv:2205.05653v1 [math.OC])
    In this paper, we revisit the smooth and strongly-convex-strongly-concave minimax optimization problem. Zhang et al. (2021) and Ibrahim et al. (2020) established the lower bound $\Omega\left(\sqrt{\kappa_x\kappa_y} \log \frac{1}{\epsilon}\right)$ on the number of gradient evaluations required to find an $\epsilon$-accurate solution, where $\kappa_x$ and $\kappa_y$ are condition numbers for the strong convexity and strong concavity assumptions. However, the existing state-of-the-art methods do not match this lower bound: algorithms of Lin et al. (2020) and Wang and Li (2020) have gradient evaluation complexity $\mathcal{O}\left( \sqrt{\kappa_x\kappa_y}\log^3\frac{1}{\epsilon}\right)$ and $\mathcal{O}\left( \sqrt{\kappa_x\kappa_y}\log^3 (\kappa_x\kappa_y)\log\frac{1}{\epsilon}\right)$, respectively. We fix this fundamental issue by providing the first algorithm with $\mathcal{O}\left(\sqrt{\kappa_x\kappa_y}\log\frac{1}{\epsilon}\right)$ gradient evaluation complexity. We design our algorithm in three steps: (i) we reformulate the original problem as a minimization problem via the pointwise conjugate function; (ii) we apply a specific variant of the proximal point algorithm to the reformulated problem; (iii) we compute the proximal operator inexactly using the optimal algorithm for operator norm reduction in monotone inclusions.
    Truncated Emphatic Temporal Difference Methods for Prediction and Control. (arXiv:2108.05338v2 [cs.LG] UPDATED)
    Emphatic Temporal Difference (TD) methods are a class of off-policy Reinforcement Learning (RL) methods involving the use of followon traces. Despite the theoretical success of emphatic TD methods in addressing the notorious deadly triad of off-policy RL, there are still two open problems. First, followon traces typically suffer from large variance, making them hard to use in practice. Second, though Yu (2015) confirms the asymptotic convergence of some emphatic TD methods for prediction problems, there is still no finite sample analysis for any emphatic TD method for prediction, much less control. In this paper, we address those two open problems simultaneously via using truncated followon traces in emphatic TD methods. Unlike the original followon traces, which depend on all previous history, truncated followon traces depend on only finite history, reducing variance and enabling the finite sample analysis of our proposed emphatic TD methods for both prediction and control.
    Scream Detection in Heavy Metal Music. (arXiv:2205.05580v1 [cs.SD])
    Harsh vocal effects such as screams or growls are far more common in heavy metal vocals than the traditionally sung vocal. This paper explores the problem of detection and classification of extreme vocal techniques in heavy metal music, specifically the identification of different scream techniques. We investigate the suitability of various feature representations, including cepstral, spectral, and temporal features as input representations for classification. The main contributions of this work are (i) a manually annotated dataset comprised of over 280 minutes of heavy metal songs of various genres with a statistical analysis of occurrences of different extreme vocal techniques in heavy metal music, and (ii) a systematic study of different input feature representations for the classification of heavy metal vocals
    A Survey on Fairness for Machine Learning on Graphs. (arXiv:2205.05396v1 [cs.LG])
    Nowadays, the analysis of complex phenomena modeled by graphs plays a crucial role in many real-world application domains where decisions can have a strong societal impact. However, numerous studies and papers have recently revealed that machine learning models could lead to potential disparate treatment between individuals and unfair outcomes. In that context, algorithmic contributions for graph mining are not spared by the problem of fairness and present some specific challenges related to the intrinsic nature of graphs: (1) graph data is non-IID, and this assumption may invalidate many existing studies in fair machine learning, (2) suited metric definitions to assess the different types of fairness with relational data and (3) algorithmic challenge on the difficulty of finding a good trade-off between model accuracy and fairness. This survey is the first one dedicated to fairness for relational data. It aims to present a comprehensive review of state-of-the-art techniques in fairness on graph mining and identify the open challenges and future trends. In particular, we start by presenting several sensible application domains and the associated graph mining tasks with a focus on edge prediction and node classification in the sequel. We also recall the different metrics proposed to evaluate potential bias at different levels of the graph mining process; then we provide a comprehensive overview of recent contributions in the domain of fair machine learning for graphs, that we classify into pre-processing, in-processing and post-processing models. We also propose to describe existing graph data, synthetic and real-world benchmarks. Finally, we present in detail five potential promising directions to advance research in studying algorithmic fairness on graphs.
    Symphony Generation with Permutation Invariant Language Model. (arXiv:2205.05448v1 [cs.SD])
    In this work, we present a symbolic symphony music generation solution, SymphonyNet, based on a permutation invariant language model. To bridge the gap between text generation and symphony generation task, we propose a novel Multi-track Multi-instrument Repeatable (MMR) representation with particular 3-D positional embedding and a modified Byte Pair Encoding algorithm (Music BPE) for music tokens. A novel linear transformer decoder architecture is introduced as a backbone for modeling extra-long sequences of symphony tokens. Meanwhile, we train the decoder to learn automatic orchestration as a joint task by masking instrument information from the input. We also introduce a large-scale symbolic symphony dataset for the advance of symphony generation research. Our empirical results show that our proposed approach can generate coherent, novel, complex and harmonious symphony compared to human composition, which is the pioneer solution for multi-track multi-instrument symbolic music generation.
    Quantification of Actual Road User Behavior on the Basis of Given Traffic Rules. (arXiv:2202.09269v3 [cs.RO] UPDATED)
    Driving on roads is restricted by various traffic rules, aiming to ensure safety for all traffic participants. However, human road users usually do not adhere to these rules strictly, resulting in varying degrees of rule conformity. Such deviations from given rules are key components of today's road traffic. In autonomous driving, robotic agents can disturb traffic flow, when rule deviations are not taken into account. In this paper, we present an approach to derive the distribution of degrees of rule conformity from human driving data. We demonstrate our method with the Waymo Open Motion dataset and Safety Distance and Speed Limit rules.
    Re-evaluating Word Mover's Distance. (arXiv:2105.14403v2 [cs.LG] UPDATED)
    The word mover's distance (WMD) is a fundamental technique for measuring the similarity of two documents. As the crux of WMD, it can take advantage of the underlying geometry of the word space by employing an optimal transport formulation. The original study on WMD reported that WMD outperforms classical baselines such as bag-of-words (BOW) and TF-IDF by significant margins in various datasets. In this paper, we point out that the evaluation in the original study could be misleading. We re-evaluate the performances of WMD and the classical baselines and find that the classical baselines are competitive with WMD if we employ an appropriate preprocessing, i.e., L1 normalization. In addition, We introduce an analogy between WMD and L1-normalized BOW and find that not only the performance of WMD but also the distance values resemble those of BOW in high dimensional spaces.
    A Ubiquitous Unifying Degeneracy in Two-Body Microlensing Systems. (arXiv:2111.13696v2 [astro-ph.EP] UPDATED)
    While gravitational microlensing by planetary systems provides unique vistas on the properties of exoplanets, observations of a given 2-body microlensing event can often be interpreted with multiple distinct physical configurations. Such ambiguities are typically attributed to the close-wide and inner-outer types of degeneracies that arise from transformation invariances and symmetries of microlensing caustics. However, there remain unexplained inconsistencies between aforementioned theories and observations. Here, leveraging a fast machine learning inference framework, we present the discovery of the offset degeneracy, which concerns a magnification-matching behaviour on the lens-axis and is formulated independent of caustics. This offset degeneracy unifies the close-wide and inner-outer degeneracies, generalises to resonant topologies, and upon reanalysis, not only appears ubiquitous in previously published planetary events with 2-fold degenerate solutions, but also resolves prior inconsistencies. Our analysis demonstrates that degenerate caustics do not strictly result in degenerate magnifications and that the commonly invoked close-wide degeneracy essentially never arises in actual events. Moreover, it is shown that parameters in offset degenerate configurations are related by a simple expression. This suggests the existence of a deeper symmetry in the equations governing 2-body lenses than previously recognised.
    Influence-Driven Data Poisoning in Graph-Based Semi-Supervised Classifiers. (arXiv:2012.07381v2 [cs.LG] UPDATED)
    Graph-based Semi-Supervised Learning (GSSL) is a practical solution to learn from a limited amount of labelled data together with a vast amount of unlabelled data. However, due to their reliance on the known labels to infer the unknown labels, these algorithms are sensitive to data quality. It is therefore essential to study the potential threats related to the labelled data, more specifically, label poisoning. In this paper, we propose a novel data poisoning method which efficiently approximates the result of label inference to identify the inputs which, if poisoned, would produce the highest number of incorrectly inferred labels. We extensively evaluate our approach on three classification problems under 24 different experimental settings each. Compared to the state of the art, our influence-driven attack produces an average increase of error rate 50\% higher, while being faster by multiple orders of magnitude. Moreover, our method can inform engineers of inputs that deserve investigation (relabelling them) before training the learning model. We show that relabelling one-third of the poisoned inputs (selected based on their influence) reduces the poisoning effect by 50\%.
    Secure Federated Learning for Neuroimaging. (arXiv:2205.05249v1 [cs.LG])
    The amount of biomedical data continues to grow rapidly. However, the ability to collect data from multiple sites for joint analysis remains challenging due to security, privacy, and regulatory concerns. We present a Secure Federated Learning architecture, MetisFL, which enables distributed training of neural networks over multiple data sources without sharing data. Each site trains the neural network over its private data for some time, then shares the neural network parameters (i.e., weights, gradients) with a Federation Controller, which in turn aggregates the local models, sends the resulting community model back to each site, and the process repeats. Our architecture provides strong security and privacy. First, sample data never leaves a site. Second, neural parameters are encrypted before transmission and the community model is computed under fully-homomorphic encryption. Finally, we use information-theoretic methods to limit information leakage from the neural model to prevent a curious site from performing membership attacks. We demonstrate this architecture in neuroimaging. Specifically, we investigate training neural models to classify Alzheimer's disease, and estimate Brain Age, from magnetic resonance imaging datasets distributed across multiple sites, including heterogeneous environments where sites have different amounts of data, statistical distributions, and computational capabilities.
    Theory and Implementation of Process and Temperature Scalable Shape-based CMOS Analog Circuits. (arXiv:2205.05664v1 [cs.AR])
    Analog computing is attractive to its digital counterparts due to its potential for achieving high compute density and energy efficiency. However, the device-to-device variability and challenges in porting existing designs to advance process nodes have posed a major hindrance in harnessing the full potential of analog computations for Machine Learning (ML) applications. This work proposes a novel analog computing framework for designing an analog ML processor similar to that of a digital design - where the designs can be scaled and ported to advanced process nodes without architectural changes. At the core of our work lies shape-based analog computing (S-AC). It utilizes device primitives to yield a robust proto-function through which other non-linear shapes can be derived. S-AC paradigm also allows the user to trade off computational precision with silicon circuit area and power. Thus allowing users to build a truly power-efficient and scalable analog architecture where the same synthesized analog circuit can operate across different biasing regimes of transistors and simultaneously scale across process nodes. As a proof of concept, we show the implementation of commonly used mathematical functions for carrying standard ML tasks in both planar CMOS 180nm and FinFET 7nm process nodes. The synthesized Shape-based ML architecture has been demonstrated for its classification accuracy on standard data sets at different process nodes.
    Generating Annotated Training Data for 6D Object Pose Estimation in Operational Environments with Minimal User Interaction. (arXiv:2103.09696v3 [cs.RO] UPDATED)
    Recently developed deep neural networks achieved state-of-the-art results in the subject of 6D object pose estimation for robot manipulation. However, those supervised deep learning methods require expensive annotated training data. Current methods for reducing those costs frequently use synthetic data from simulations, but rely on expert knowledge and suffer from the "domain gap" when shifting to the real world. Here, we present a proof of concept for a novel approach of autonomously generating annotated training data for 6D object pose estimation. This approach is designed for learning new objects in operational environments while requiring little interaction and no expertise on the part of the user. We evaluate our autonomous data generation approach in two grasping experiments, where we archive a similar grasping success rate as related work on a non autonomously generated data set.
    Utilizing coarse-grained data in low-data settings for event extraction. (arXiv:2205.05468v1 [cs.CL])
    Annotating text data for event information extraction systems is hard, expensive, and error-prone. We investigate the feasibility of integrating coarse-grained data (document or sentence labels), which is far more feasible to obtain, instead of annotating more documents. We utilize a multi-task model with two auxiliary tasks, document and sentence binary classification, in addition to the main task of token classification. We perform a series of experiments with varying data regimes for the aforementioned integration. Results show that while introducing extra coarse-grained data offers greater improvement and robustness, a gain is still possible with only the addition of negative documents that have no information on any event.
    Towards An Efficient Approach for the Nonconvex $\ell_p$ Ball Projection: Algorithm and Analysis. (arXiv:2101.01350v6 [math.OC] UPDATED)
    This paper primarily focuses on computing the Euclidean projection of a vector onto the $\ell_{p}$ ball in which $p\in(0,1)$. Such a problem emerges as the core building block in statistical machine learning and signal processing tasks because of its ability to promote the sparsity of the desired solution. However, efficient numerical algorithms for finding the projections are still not available, particularly in large-scale optimization. To meet this challenge, we first derive the first-order necessary optimality conditions of this problem. Based on this characterization, we develop a novel numerical approach for computing the stationary point by solving a sequence of projections onto the reweighted $\ell_{1}$-balls. This method is practically simple to implement and computationally efficient. Moreover, the proposed algorithm is shown to converge uniquely under mild conditions and has a worst-case $O(1/\sqrt{k})$ convergence rate. Numerical experiments demonstrate the efficiency of our proposed algorithm.
    Towards Model Agnostic Federated Learning Using Knowledge Distillation. (arXiv:2110.15210v2 [cs.LG] UPDATED)
    Is it possible to design an universal API for federated learning using which an ad-hoc group of data-holders (agents) collaborate with each other and perform federated learning? Such an API would necessarily need to be model-agnostic i.e. make no assumption about the model architecture being used by the agents, and also cannot rely on having representative public data at hand. Knowledge distillation (KD) is the obvious tool of choice to design such protocols. However, surprisingly, we show that most natural KD-based federated learning protocols have poor performance. To investigate this, we propose a new theoretical framework, Federated Kernel ridge regression, which can capture both model heterogeneity as well as data heterogeneity. Our analysis shows that the degradation is largely due to a fundamental limitation of knowledge distillation under data heterogeneity. We further validate our framework by analyzing and designing new protocols based on KD. Their performance on real world experiments using neural networks, though still unsatisfactory, closely matches our theoretical predictions.
    A Word is Worth A Thousand Dollars: Adversarial Attack on Tweets Fools Stock Prediction. (arXiv:2205.01094v2 [cs.CR] UPDATED)
    More and more investors and machine learning models rely on social media (e.g., Twitter and Reddit) to gather real-time information and sentiment to predict stock price movements. Although text-based models are known to be vulnerable to adversarial attacks, whether stock prediction models have similar vulnerability is underexplored. In this paper, we experiment with a variety of adversarial attack configurations to fool three stock prediction victim models. We address the task of adversarial generation by solving combinatorial optimization problems with semantics and budget constraints. Our results show that the proposed attack method can achieve consistent success rates and cause significant monetary loss in trading simulation by simply concatenating a perturbed but semantically similar tweet.
    Salient Object Detection via Bounding-box Supervision. (arXiv:2205.05245v1 [cs.CV])
    The success of fully supervised saliency detection models depends on a large number of pixel-wise labeling. In this paper, we work on bounding-box based weakly-supervised saliency detection to relieve the labeling effort. Given the bounding box annotation, we observe that pixels inside the bounding box may contain extensive labeling noise. However, as a large amount of background is excluded, the foreground bounding box region contains a less complex background, making it possible to perform handcrafted features-based saliency detection with only the cropped foreground region. As the conventional handcrafted features are not representative enough, leading to noisy saliency maps, we further introduce structure-aware self-supervised loss to regularize the structure of the prediction. Further, we claim that pixels outside the bounding box should be background, thus partial cross-entropy loss function can be used to accurately localize the accurate background region. Experimental results on six benchmark RGB saliency datasets illustrate the effectiveness of our model.
    Social Inclusion in Curated Contexts: Insights from Museum Practices. (arXiv:2205.05192v1 [cs.LG])
    Artificial intelligence literature suggests that minority and fragile communities in society can be negatively impacted by machine learning algorithms due to inherent biases in the design process, which lead to socially exclusive decisions and policies. Faced with similar challenges in dealing with an increasingly diversified audience, the museum sector has seen changes in theory and practice, particularly in the areas of representation and meaning-making. While rarity and grandeur used to be at the centre stage of the early museum practices, folk life and museums' relationships with the diverse communities they serve become a widely integrated part of the contemporary practices. These changes address issues of diversity and accessibility in order to offer more socially inclusive services. Drawing on these changes and reflecting back on the AI world, we argue that the museum experience provides useful lessons for building AI with socially inclusive approaches, especially in situations in which both a collection and access to it will need to be curated or filtered, as frequently happens in search engines, recommender systems and digital libraries. We highlight three principles: (1) Instead of upholding the value of neutrality, practitioners are aware of the influences of their own backgrounds and those of others on their work. By not claiming to be neutral but practising cultural humility, the chances of addressing potential biases can be increased. (2) There should be room for situational interpretation beyond the stages of data collection and machine learning. Before applying models and predictions, the contexts in which relevant parties exist should be taken into account. (3) Community participation serves the needs of communities and has the added benefit of bringing practitioners and communities together.
    Learning Multitask Gaussian Bayesian Networks. (arXiv:2205.05343v1 [stat.ML])
    Major depressive disorder (MDD) requires study of brain functional connectivity alterations for patients, which can be uncovered by resting-state functional magnetic resonance imaging (rs-fMRI) data. We consider the problem of identifying alterations of brain functional connectivity for a single MDD patient. This is particularly difficult since the amount of data collected during an fMRI scan is too limited to provide sufficient information for individual analysis. Additionally, rs-fMRI data usually has the characteristics of incompleteness, sparsity, variability, high dimensionality and high noise. To address these problems, we proposed a multitask Gaussian Bayesian network (MTGBN) framework capable for identifying individual disease-induced alterations for MDD patients. We assume that such disease-induced alterations show some degrees of similarity with the tool to learn such network structures from observations to understanding of how system are structured jointly from related tasks. First, we treat each patient in a class of observation as a task and then learn the Gaussian Bayesian networks (GBNs) of this data class by learning from all tasks that share a default covariance matrix that encodes prior knowledge. This setting can help us to learn more information from limited data. Next, we derive a closed-form formula of the complete likelihood function and use the Monte-Carlo Expectation-Maximization(MCEM) algorithm to search for the approximately best Bayesian network structures efficiently. Finally, we assess the performance of our methods with simulated and real-world rs-fMRI data.
    CNN-LSTM Based Multimodal MRI and Clinical Data Fusion for Predicting Functional Outcome in Stroke Patients. (arXiv:2205.05545v1 [eess.IV])
    Clinical outcome prediction plays an important role in stroke patient management. From a machine learning point-of-view, one of the main challenges is dealing with heterogeneous data at patient admission, i.e. the image data which are multidimensional and the clinical data which are scalars. In this paper, a multimodal convolutional neural network - long short-term memory (CNN-LSTM) based ensemble model is proposed. For each MR image module, a dedicated network provides preliminary prediction of the clinical outcome using the modified Rankin scale (mRS). The final mRS score is obtained by merging the preliminary probabilities of each module dedicated to a specific type of MR image weighted by the clinical metadata, here age or the National Institutes of Health Stroke Scale (NIHSS). The experimental results demonstrate that the proposed model surpasses the baselines and offers an original way to automatically encode the spatio-temporal context of MR images in a deep learning architecture. The highest AUC (0.77) was achieved for the proposed model with NIHSS.
    On Distributed Adaptive Optimization with Gradient Compression. (arXiv:2205.05632v1 [stat.ML])
    We study COMP-AMS, a distributed optimization framework based on gradient averaging and adaptive AMSGrad algorithm. Gradient compression with error feedback is applied to reduce the communication cost in the gradient transmission process. Our convergence analysis of COMP-AMS shows that such compressed gradient averaging strategy yields same convergence rate as standard AMSGrad, and also exhibits the linear speedup effect w.r.t. the number of local workers. Compared with recently proposed protocols on distributed adaptive methods, COMP-AMS is simple and convenient. Numerical experiments are conducted to justify the theoretical findings, and demonstrate that the proposed method can achieve same test accuracy as the full-gradient AMSGrad with substantial communication savings. With its simplicity and efficiency, COMP-AMS can serve as a useful distributed training framework for adaptive gradient methods.
    RvS: What is Essential for Offline RL via Supervised Learning?. (arXiv:2112.10751v2 [cs.LG] UPDATED)
    Recent work has shown that supervised learning alone, without temporal difference (TD) learning, can be remarkably effective for offline RL. When does this hold true, and which algorithmic components are necessary? Through extensive experiments, we boil supervised learning for offline RL down to its essential elements. In every environment suite we consider, simply maximizing likelihood with a two-layer feedforward MLP is competitive with state-of-the-art results of substantially more complex methods based on TD learning or sequence modeling with Transformers. Carefully choosing model capacity (e.g., via regularization or architecture) and choosing which information to condition on (e.g., goals or rewards) are critical for performance. These insights serve as a field guide for practitioners doing Reinforcement Learning via Supervised Learning (which we coin "RvS learning"). They also probe the limits of existing RvS methods, which are comparatively weak on random data, and suggest a number of open problems.
    Self Reward Design with Fine-grained Interpretability. (arXiv:2112.15034v2 [cs.LG] UPDATED)
    Transparency and fairness issues stem from the black-box nature of deep neural networks (DNN). They are relevant to Deep Reinforcement Learning which also use DNN to learn its policy, value functions etc. This paper proposes a way to circumvent the issues through the bottom-up design of neural networks (NN) with detailed interpretability, where each neuron or layer has its own meaning and utility that corresponds to humanly understandable concept. With deliberate design, we show that lavaland problems can be solved using NN model with few parameters. Furthermore, we introduce the Self Reward Design (SRD), inspired by the Inverse Reward Design, so that our interpretable design can (1) solve the problem by pure design (although imperfectly) (2) be optimized via SRD (3) perform avoidance of unknown states by recognizing the inactivations of neurons aggregated as the activation in \(w_{unknown}\).
    RepSR: Training Efficient VGG-style Super-Resolution Networks with Structural Re-Parameterization and Batch Normalization. (arXiv:2205.05671v1 [cs.CV])
    This paper explores training efficient VGG-style super-resolution (SR) networks with the structural re-parameterization technique. The general pipeline of re-parameterization is to train networks with multi-branch topology first, and then merge them into standard 3x3 convolutions for efficient inference. In this work, we revisit those primary designs and investigate essential components for re-parameterizing SR networks. First of all, we find that batch normalization (BN) is important to bring training non-linearity and improve the final performance. However, BN is typically ignored in SR, as it usually degrades the performance and introduces unpleasant artifacts. We carefully analyze the cause of BN issue and then propose a straightforward yet effective solution. In particular, we first train SR networks with mini-batch statistics as usual, and then switch to using population statistics at the later training period. While we have successfully re-introduced BN into SR, we further design a new re-parameterizable block tailored for SR, namely RepSR. It consists of a clean residual path and two expand-and-squeeze convolution paths with the modified BN. Extensive experiments demonstrate that our simple RepSR is capable of achieving superior performance to previous SR re-parameterization methods among different model sizes. In addition, our RepSR can achieve a better trade-off between performance and actual running time (throughput) than previous SR methods. Codes will be available at https://github.com/TencentARC/RepSR.
    Post-Hoc Explanations Fail to Achieve their Purpose in Adversarial Contexts. (arXiv:2201.10295v2 [cs.LG] UPDATED)
    Existing and planned legislation stipulates various obligations to provide information about machine learning algorithms and their functioning, often interpreted as obligations to "explain". Many researchers suggest using post-hoc explanation algorithms for this purpose. In this paper, we combine legal, philosophical and technical arguments to show that post-hoc explanation algorithms are unsuitable to achieve the law's objectives. Indeed, most situations where explanations are requested are adversarial, meaning that the explanation provider and receiver have opposing interests and incentives, so that the provider might manipulate the explanation for her own ends. We show that this fundamental conflict cannot be resolved because of the high degree of ambiguity of post-hoc explanations in realistic application scenarios. As a consequence, post-hoc explanation algorithms are unsuitable to achieve the transparency objectives inherent to the legal norms. Instead, there is a need to more explicitly discuss the objectives underlying "explainability" obligations as these can often be better achieved through other mechanisms. There is an urgent need for a more open and honest discussion regarding the potential and limitations of post-hoc explanations in adversarial contexts, in particular in light of the current negotiations of the European Union's draft Artificial Intelligence Act.
    Hierarchical Collaborative Hyper-parameter Tuning. (arXiv:2205.05272v1 [cs.LG])
    Hyper-parameter Tuning is among the most critical stages in building machine learning solutions. This paper demonstrates how multi-agent systems can be utilized to develop a distributed technique for determining near-optimal values for any arbitrary set of hyper-parameters in a machine learning model. The proposed method employs a distributedly formed hierarchical agent-based architecture for the cooperative searching procedure of tuning hyper-parameter values. The presented generic model is used to develop a guided randomized agent-based tuning technique, and its behavior is investigated in both machine learning and global function optimization applications. According the empirical results, the proposed model outperformed both of its underlying randomized tuning strategies in terms of classification error and function evaluations, notably in higher number of dimensions.
    Efficient Automated Deep Learning for Time Series Forecasting. (arXiv:2205.05511v1 [cs.LG])
    Recent years have witnessed tremendously improved efficiency of Automated Machine Learning (AutoML), especially Automated Deep Learning (AutoDL) systems, but recent work focuses on tabular, image, or NLP tasks. So far, little attention has been paid to general AutoDL frameworks for time series forecasting, despite the enormous success in applying different novel architectures to such tasks. In this paper, we propose an efficient approach for the joint optimization of neural architecture and hyperparameters of the entire data processing pipeline for time series forecasting. In contrast to common NAS search spaces, we designed a novel neural architecture search space covering various state-of-the-art architectures, allowing for an efficient macro-search over different DL approaches. To efficiently search in such a large configuration space, we use Bayesian optimization with multi-fidelity optimization. We empirically study several different budget types enabling efficient multi-fidelity optimization on different forecasting datasets. Furthermore, we compared our resulting system, dubbed Auto-PyTorch-TS, against several established baselines and show that it significantly outperforms all of them across several datasets.
    DoubleMatch: Improving Semi-Supervised Learning with Self-Supervision. (arXiv:2205.05575v1 [cs.LG])
    Following the success of supervised learning, semi-supervised learning (SSL) is now becoming increasingly popular. SSL is a family of methods, which in addition to a labeled training set, also use a sizable collection of unlabeled data for fitting a model. Most of the recent successful SSL methods are based on pseudo-labeling approaches: letting confident model predictions act as training labels. While these methods have shown impressive results on many benchmark datasets, a drawback of this approach is that not all unlabeled data are used during training. We propose a new SSL algorithm, DoubleMatch, which combines the pseudo-labeling technique with a self-supervised loss, enabling the model to utilize all unlabeled data in the training process. We show that this method achieves state-of-the-art accuracies on multiple benchmark datasets while also reducing training times compared to existing SSL methods. Code is available at https://github.com/walline/doublematch.
    Deep fusion of gray level co-occurrence matrices for lung nodule classification. (arXiv:2205.05123v1 [eess.IV])
    Lung cancer is a severe menace to human health, due to which millions of people die because of late diagnoses of cancer; thus, it is vital to detect the disease as early as possible. The Computerized chest analysis Tomography of scan is assumed to be one of the efficient solutions for detecting and classifying lung nodules. The necessity of high accuracy of analyzing C.T. scan images of the lung is considered as one of the crucial challenges in detecting and classifying lung cancer. A new long-short-term-memory (LSTM) based deep fusion structure, is introduced, where, the texture features computed from lung nodules through new volumetric grey-level-co-occurrence-matrices (GLCM) computations are applied to classify the nodules into: benign, malignant and ambiguous. An improved Otsu segmentation method combined with the water strider optimization algorithm (WSA) is proposed to detect the lung nodules. Otsu-WSA thresholding can overcome the restrictions present in previous thresholding methods. Extended experiments are run to assess this fusion structure by considering 2D-GLCM computations based 2D-slices fusion, and an approximation of this 3D-GLCM with volumetric 2.5D-GLCM computations-based LSTM fusion structure. The proposed methods are trained and assessed through the LIDC-IDRI dataset, where 94.4%, 91.6%, and 95.8% Accuracy, sensitivity, and specificity are obtained, respectively for 2D-GLCM fusion and 97.33%, 96%, and 98%, accuracy, sensitivity, and specificity, respectively, for 2.5D-GLCM fusion. The yield of the same are 98.7%, 98%, and 99%, for the 3D-GLCM fusion. The obtained results and analysis indicate that the WSA-Otsu method requires less execution time and yields a more accurate thresholding process. It is found that 3D-GLCM based LSTM outperforms its counterparts.
    Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection. (arXiv:2205.05206v1 [eess.AS])
    Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the audio. Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces, at the cost of sacrificing some accuracy on active speaker detection. This work closes this gap in active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss. By combining the two tasks during training we reduce the ASD classification accuracy by approximately 25%, while simultaneously improving the ASR performance when compared to the multi-person baseline trained exclusively for ASR.
    Quantum Self-Attention Neural Networks for Text Classification. (arXiv:2205.05625v1 [quant-ph])
    An emerging direction of quantum computing is to establish meaningful quantum applications in various fields of artificial intelligence, including natural language processing (NLP). Although some efforts based on syntactic analysis have opened the door to research in Quantum NLP (QNLP), limitations such as heavy syntactic preprocessing and syntax-dependent network architecture make them impracticable on larger and real-world data sets. In this paper, we propose a new simple network architecture, called the quantum self-attention neural network (QSANN), which can make up for these limitations. Specifically, we introduce the self-attention mechanism into quantum neural networks and then utilize a Gaussian projected quantum self-attention serving as a sensible quantum version of self-attention. As a result, QSANN is effective and scalable on larger data sets and has the desirable property of being implementable on near-term quantum devices. In particular, our QSANN outperforms the best existing QNLP model based on syntactic analysis as well as a simple classical self-attention neural network in numerical experiments of text classification tasks on public data sets. We further show that our method exhibits robustness to low-level quantum noises.
    Deep Architecture Connectivity Matters for Its Convergence: A Fine-Grained Analysis. (arXiv:2205.05662v1 [cs.LG])
    Advanced deep neural networks (DNNs), designed by either human or AutoML algorithms, are growing increasingly complex. Diverse operations are connected by complicated connectivity patterns, e.g., various types of skip connections. Those topological compositions are empirically effective and observed to smooth the loss landscape and facilitate the gradient flow in general. However, it remains elusive to derive any principled understanding of their effects on the DNN capacity or trainability, and to understand why or in which aspect one specific connectivity pattern is better than another. In this work, we theoretically characterize the impact of connectivity patterns on the convergence of DNNs under gradient descent training in fine granularity. By analyzing a wide network's Neural Network Gaussian Process (NNGP), we are able to depict how the spectrum of an NNGP kernel propagates through a particular connectivity pattern, and how that affects the bound of convergence rates. As one practical implication of our results, we show that by a simple filtration on "unpromising" connectivity patterns, we can trim down the number of models to evaluate, and significantly accelerate the large-scale neural architecture search without any overhead. Codes will be released at https://github.com/chenwydj/architecture_convergence.
    Human Language Modeling. (arXiv:2205.05128v1 [cs.CL])
    Natural language is generated by people, yet traditional language modeling views words or documents as if generated independently. Here, we propose human language modeling (HuLM), a hierarchical extension to the language modeling problem whereby a human-level exists to connect sequences of documents (e.g. social media messages) and capture the notion that human language is moderated by changing human states. We introduce, HaRT, a large-scale transformer model for the HuLM task, pre-trained on approximately 100,000 social media users, and demonstrate its effectiveness in terms of both language modeling (perplexity) for social media and fine-tuning for 4 downstream tasks spanning document- and user-levels: stance detection, sentiment classification, age estimation, and personality assessment. Results on all tasks meet or surpass the current state-of-the-art.
    DeepFilterNet2: Towards Real-Time Speech Enhancement on Embedded Devices for Full-Band Audio. (arXiv:2205.05474v1 [eess.AS])
    Deep learning-based speech enhancement has seen huge improvements and recently also expanded to full band audio (48 kHz). However, many approaches have a rather high computational complexity and require big temporal buffers for real time usage e.g. due to temporal convolutions or attention. Both make those approaches not feasible on embedded devices. This work further extends DeepFilterNet, which exploits harmonic structure of speech allowing for efficient speech enhancement (SE). Several optimizations in the training procedure, data augmentation, and network structure result in state-of-the-art SE performance while reducing the real-time factor to 0.04 on a notebook Core-i5 CPU. This makes the algorithm applicable to run on embedded devices in real-time. The DeepFilterNet framework can be obtained under an open source license.
    The Multiscale Structure of Neural Network Loss Functions: The Effect on Optimization and Origin. (arXiv:2204.11326v2 [cs.LG] UPDATED)
    Local quadratic approximation has been extensively used to study the optimization of neural network loss functions around the minimum. Though, it usually holds in a very small neighborhood of the minimum, and cannot explain many phenomena observed during the optimization process. In this work, we study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of good quadratic approximation. Numerically, we observe that neural network loss functions possesses a multiscale structure, manifested in two ways: (1) in a neighborhood of minima, the loss mixes a continuum of scales and grows subquadratically, and (2) in a larger region, the loss shows several separate scales clearly. Using the subquadratic growth, we are able to explain the Edge of Stability phenomenon[4] observed for gradient descent (GD) method. Using the separate scales, we explain the working mechanism of learning rate decay by simple examples. Finally, we study the origin of the multiscale structure and propose that the non-uniformity of training data is one of its cause. By constructing a two-layer neural network problem we show that training data with different magnitudes give rise to different scales of the loss function, producing subquadratic growth or multiple separate scales.
    RISP: Rendering-Invariant State Predictor with Differentiable Simulation and Rendering for Cross-Domain Parameter Estimation. (arXiv:2205.05678v1 [cs.CV])
    This work considers identifying parameters characterizing a physical system's dynamic motion directly from a video whose rendering configurations are inaccessible. Existing solutions require massive training data or lack generalizability to unknown rendering configurations. We propose a novel approach that marries domain randomization and differentiable rendering gradients to address this problem. Our core idea is to train a rendering-invariant state-prediction (RISP) network that transforms image differences into state differences independent of rendering configurations, e.g., lighting, shadows, or material reflectance. To train this predictor, we formulate a new loss on rendering variances using gradients from differentiable rendering. Moreover, we present an efficient, second-order method to compute the gradients of this loss, allowing it to be integrated seamlessly into modern deep learning frameworks. We evaluate our method in rigid-body and deformable-body simulation environments using four tasks: state estimation, system identification, imitation learning, and visuomotor control. We further demonstrate the efficacy of our approach on a real-world example: inferring the state and action sequences of a quadrotor from a video of its motion sequences. Compared with existing methods, our approach achieves significantly lower reconstruction errors and has better generalizability among unknown rendering configurations.
    Spatial-temporal associations representation and application for process monitoring using graph convolution neural network. (arXiv:2205.05250v1 [cs.LG])
    Industrial process data reflects the dynamic changes of operation conditions, which mainly refer to the irregular changes in the dynamic associations between different variables in different time. And this related associations knowledge for process monitoring is often implicit in these dynamic monitoring data which always have richer operation condition information and have not been paid enough attention in current research. To this end, a new process monitoring method based on spatial-based graph convolution neural network (SGCN) is proposed to describe the characteristics of the dynamic associations which can be used to represent the operation status over time. Spatia-temporal graphs are firstly defined, which can be used to represent the characteristics of node attributes (dynamic edge features) dynamically changing with time. Then, the associations between monitoring variables at a certain time can be considered as the node attributes to define a snapshot of the static graph network at the certain time. Finally, the snapshot containing graph structure and node attributes is used as model inputs which are processed to implement graph classification by spatial-based convolution graph neural network with aggregate and readout steps. The feasibility and applicability of this proposed method are demonstrated by our experimental results of benchmark and practical case application.
    Reducing a complex two-sided smartwatch examination for Parkinson's Disease to an efficient one-sided examination preserving machine learning accuracy. (arXiv:2205.05361v1 [cs.LG])
    Sensors from smart consumer devices have demonstrated high potential to serve as digital biomarkers in the identification of movement disorders in recent years. With the usage of broadly available smartwatches we have recorded participants performing technology-based assessments in a prospective study to research Parkinson's Disease (PD). In total, 504 participants, including PD patients, differential diagnoses (DD) and healthy controls (HC), were captured with a comprehensive system utilizing two smartwatches and two smartphones. To the best of our knowledge, this study provided the largest PD sample size of two-hand synchronous smartwatch measurements. To establish a future easy-to use home-based assessment system in PD screening, we systematically evaluated the performance of the system based on a significantly reduced set of assessments with only one-sided measures and assessed, whether we can maintain classification accuracy.
    Making Pre-trained Language Models Good Long-tailed Learners. (arXiv:2205.05461v1 [cs.CL])
    Prompt-tuning has shown appealing performance in few-shot classification by virtue of its capability in effectively exploiting pre-trained knowledge. This motivates us to check the hypothesis that prompt-tuning is also a promising choice for long-tailed classification, since the tail classes are intuitively few-shot ones. To achieve this aim, we conduct empirical studies to examine the hypothesis. The results demonstrate that prompt-tuning exactly makes pre-trained language models at least good long-tailed learners. For intuitions on why prompt-tuning can achieve good performance in long-tailed classification, we carry out an in-depth analysis by progressively bridging the gap between prompt-tuning and commonly used fine-tuning. The summary is that the classifier structure and parameterization form the key to making good long-tailed learners, in comparison with the less important input structure. Finally, we verify the applicability of our finding to few-shot classification.
    Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. (arXiv:2205.05638v1 [cs.LG])
    Few-shot in-context learning (ICL) enables pre-trained language models to perform a previously-unseen task without any gradient-based training by feeding a small number of training examples as part of the input. ICL incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. Parameter-efficient fine-tuning (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs. Along the way, we introduce a new parameter-efficient fine-tuning method called (IA)$^3$ that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. We also propose a simple recipe based on the T0 model called T-Few that can be applied to new tasks without task-specific tuning or modifications. We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark, attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute. All of the code used in our experiments is publicly available.
    Choice of training label matters: how to best use deep learning for quantitative MRI parameter estimation. (arXiv:2205.05587v1 [physics.med-ph])
    Deep learning (DL) is gaining popularity as a parameter estimation method for quantitative MRI. A range of competing implementations have been proposed, relying on either supervised or self-supervised learning. Self-supervised approaches, sometimes referred to as unsupervised, have been loosely based on auto-encoders, whereas supervised methods have, to date, been trained on groundtruth labels. These two learning paradigms have been shown to have distinct strengths. Notably, self-supervised approaches have offered lower-bias parameter estimates than their supervised alternatives. This result is counterintuitive - incorporating prior knowledge with supervised labels should, in theory, lead to improved accuracy. In this work, we show that this apparent limitation of supervised approaches stems from the naive choice of groundtruth training labels. By training on labels which are deliberately not groundtruth, we show that the low-bias parameter estimation previously associated with self-supervised methods can be replicated - and improved on - within a supervised learning framework. This approach sets the stage for a single, unifying, deep learning parameter estimation framework, based on supervised learning, where trade-offs between bias and variance are made by careful adjustment of training label.
    Benchmarking Graph Neural Networks. (arXiv:2003.00982v4 [cs.LG] UPDATED)
    In the last few years, graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs. This emerging field has witnessed an extensive growth of promising techniques that have been applied with success to computer science, mathematics, biology, physics and chemistry. But for any successful field to become mainstream and reliable, benchmarks must be developed to quantify progress. This led us in March 2020 to release a benchmark framework that i) comprises of a diverse collection of mathematical and real-world graphs, ii) enables fair model comparison with the same parameter budget to identify key architectures, iii) has an open-source, easy-to-use and reproducible code infrastructure, and iv) is flexible for researchers to experiment with new theoretical ideas. As of May 2022, the GitHub repository has reached 1,800 stars and 339 forks, which demonstrates the utility of the proposed open-source framework through the wide usage by the GNN community. In this paper, we present an updated version of our benchmark with a concise presentation of the aforementioned framework characteristics, an additional medium-sized molecular dataset AQSOL, similar to the popular ZINC, but with a real-world measured chemical target, and discuss how this framework can be leveraged to explore new GNN designs and insights. As a proof of value of our benchmark, we study the case of graph positional encoding (PE) in GNNs, which was introduced with this benchmark and has since spurred interest of exploring more powerful PE for Transformers and GNNs in a robust experimental setting.
    CVTT: Cross-Validation Through Time. (arXiv:2205.05393v1 [cs.LG])
    The practical aspects of evaluating recommender systems is an actively discussed topic in the research community. While many current evaluation techniques bring performance down to a single-value metric as a straightforward approach for model comparison, it is based on a strong assumption of the methods' stable performance over time. In this paper, we argue that leaving out a method's continuous performance can lead to losing valuable insight into joint data-method effects. We propose the Cross-Validation Thought Time (CVTT) technique to perform more detailed evaluations, which focus on model cross-validation performance over time. Using the proposed technique, we conduct a detailed analysis of popular RecSys algorithms' performance against various metrics and datasets. We also compare several data preparation and evaluation strategies to analyze their impact on model performance. Our results show that model performance can vary significantly over time, and both data and evaluation setup can have a marked effect on it.
    End-to-End Multi-Person Audio/Visual Automatic Speech Recognition. (arXiv:2205.05586v1 [eess.AS])
    Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on screen one needs to decide which face to feed to the A/V ASR system. The present work takes the recent progress of A/V ASR one step further and considers the scenario where multiple people are simultaneously on screen (multi-person A/V ASR). We propose a fully differentiable A/V ASR model that is able to handle multiple face tracks in a video. Instead of relying on two separate models for speaker face selection and audio-visual ASR on a single face track, we introduce an attention layer to the ASR encoder that is able to soft-select the appropriate face video track. Experiments carried out on an A/V system trained on over 30k hours of YouTube videos illustrate that the proposed approach can automatically select the proper face tracks with minor WER degradation compared to an oracle selection of the speaking face while still showing benefits of employing the visual signal instead of the audio alone.
    Spatial Graph Attention and Curiosity-driven Policy for Antiviral Drug Discovery. (arXiv:2106.02190v6 [cs.LG] UPDATED)
    We developed Distilled Graph Attention Policy Network (DGAPN), a reinforcement learning model to generate novel graph-structured chemical representations that optimize user-defined objectives by efficiently navigating a physically constrained domain. The framework is examined on the task of generating molecules that are designed to bind, noncovalently, to functional sites of SARS-CoV-2 proteins. We present a spatial Graph Attention (sGAT) mechanism that leverages self-attention over both node and edge attributes as well as encoding the spatial structure -- this capability is of considerable interest in synthetic biology and drug discovery. An attentional policy network is introduced to learn the decision rules for a dynamic, fragment-based chemical environment, and state-of-the-art policy gradient techniques are employed to train the network with stability. Exploration is driven by the stochasticity of the action space design and the innovation reward bonuses learned and proposed by random network distillation. In experiments, our framework achieved outstanding results compared to state-of-the-art algorithms, while reducing the complexity of paths to chemical synthesis.
    Learning Spatiotemporal Chaos Using Next-Generation Reservoir Computing. (arXiv:2203.13294v2 [cs.LG] UPDATED)
    Forecasting the behavior of high-dimensional dynamical systems using machine learning requires efficient methods to learn the underlying physical model. We demonstrate spatiotemporal chaos prediction using a machine learning architecture that, when combined with a next-generation reservoir computer, displays state-of-the-art performance with a training time $10^3-10^4$ times faster and training data set $\sim 10^2$ times smaller than other machine learning algorithms. We also take advantage of the translational symmetry of the model to further reduce the computational cost and training data, each by a factor of $\sim$10.
    Performance of a deep learning system for detection of referable diabetic retinopathy in real clinical settings. (arXiv:2205.05554v1 [eess.IV])
    Background: To determine the ability of a commercially available deep learning system, RetCAD v.1.3.1 (Thirona, Nijmegen, The Netherlands) for the automatic detection of referable diabetic retinopathy (DR) on a dataset of colour fundus images acquired during routine clinical practice in a tertiary hospital screening program, analyzing the reduction of workload that can be released incorporating this artificial intelligence-based technology. Methods: Evaluation of the software was performed on a dataset of 7195 nonmydriatic fundus images from 6325 eyes of 3189 diabetic patients attending our screening program between February to December of 2019. The software generated a DR severity score for each colour fundus image which was combined into an eye-level score. This score was then compared with a reference standard as set by a human expert using receiver operating characteristic (ROC) curve analysis. Results: The artificial intelligence (AI) software achieved an area under the ROC curve (AUC) value of 0.988 [0.981:0.993] for the detection of referable DR. At the proposed operating point, the sensitivity of the RetCAD software for DR is 90.53% and specificity is 97.13%. A workload reduction of 96% could be achieved at the cost of only 6 false negatives. Conclusions: The AI software correctly identified the vast majority of referable DR cases, with a workload reduction of 96% of the cases that would need to be checked, while missing almost no true cases, so it may therefore be used as an instrument for triage.
    Deep Graph Clustering via Mutual Information Maximization and Mixture Model. (arXiv:2205.05168v1 [cs.LG])
    Attributed graph clustering or community detection which learns to cluster the nodes of a graph is a challenging task in graph analysis. In this paper, we introduce a contrastive learning framework for learning clustering-friendly node embedding. Although graph contrastive learning has shown outstanding performance in self-supervised graph learning, using it for graph clustering is not well explored. We propose Gaussian mixture information maximization (GMIM) which utilizes a mutual information maximization approach for node embedding. Meanwhile, it assumes that the representation space follows a Mixture of Gaussians (MoG) distribution. The clustering part of our objective tries to fit a Gaussian distribution to each community. The node embedding is jointly optimized with the parameters of MoG in a unified framework. Experiments on real-world datasets demonstrate the effectiveness of our method in community detection.
    Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise. (arXiv:2102.04297v4 [cs.LG] UPDATED)
    The empirical success of deep learning is often attributed to SGD's mysterious ability to avoid sharp local minima in the loss landscape, as sharp minima are known to lead to poor generalization. Recently, empirical evidence of heavy-tailed gradient noise was reported in many deep learning tasks, and it was shown in \c{S}im\c{s}ekli (2019a,b) that SGD can escape sharp local minima under the presence of such heavy-tailed gradient noise, providing a partial solution to the mystery. In this work, we analyze a popular variant of SGD where gradients are truncated above a fixed threshold. We show that it achieves a stronger notion of avoiding sharp minima: it can effectively eliminate sharp local minima entirely from its training trajectory. We characterize the dynamics of truncated SGD driven by heavy-tailed noises. First, we show that the truncation threshold and width of the attraction field dictate the order of the first exit time from the associated local minimum. Moreover, when the objective function satisfies appropriate structural conditions, we prove that as the learning rate decreases, the dynamics of heavy-tailed truncated SGD closely resemble those of a continuous-time Markov chain that never visits any sharp minima. Real data experiments on deep learning confirm our theoretical prediction that heavy-tailed SGD with gradient clipping finds a "flatter" local minima and achieves better generalization.
    A simple framework for contrastive learning phases of matter. (arXiv:2205.05607v1 [cond-mat.dis-nn])
    A main task in condensed-matter physics is to recognize, classify, and characterize phases of matter and the corresponding phase transitions, for which machine learning provides a new class of research tools due to the remarkable development in computing power and algorithms. Despite much exploration in this new field, usually different methods and techniques are needed for different scenarios. Here, we present SimCLP: a simple framework for contrastive learning phases of matter, which is inspired by the recent development in contrastive learning of visual representations. We demonstrate the success of this framework on several representative systems, including classical and quantum, single-particle and many-body, conventional and topological. SimCLP is flexible and free of usual burdens such as manual feature engineering and prior knowledge. The only prerequisite is to prepare enough state configurations. Furthermore, it can generate representation vectors and labels and hence help tackle other problems. SimCLP therefore paves an alternative way to the development of a generic tool for identifying unexplored phase transitions.
    A Model-Free Sampling Method for Estimating Basins of Attraction Using Hybrid Active Learning (HAL). (arXiv:2003.10976v3 [cs.LG] UPDATED)
    Understanding the basins of attraction (BoA) is often a paramount consideration for nonlinear systems. Most existing approaches to determining a high-resolution BoA require prior knowledge of the system's dynamical model (e.g., differential equation or point mapping for continuous systems, cell mapping for discrete systems, etc.), which allows derivation of approximate analytical solutions or parallel computing on a multi-core computer to find the BoA efficiently. However, these methods are typically impractical when the BoA must be determined experimentally or when the system's model is unknown. This paper introduces a model-free sampling method for BoA. The proposed method is based upon hybrid active learning (HAL) and is designed to find and label the "informative" samples, which efficiently determine the boundary of BoA. It consists of three primary parts: 1) additional sampling on trajectories (AST) to maximize the number of samples obtained from each simulation or experiment; 2) an active learning (AL) algorithm to exploit the local boundary of BoA; and 3) a density-based sampling (DBS) method to explore the global boundary of BoA. An example of estimating the BoA for a bistable nonlinear system is presented to show the high efficiency of our HAL sampling method.
    Stochastic differential equations for limiting description of UCB rule for Gaussian multi-armed bandits. (arXiv:2112.06423v2 [cs.LG] UPDATED)
    We consider the upper confidence bound strategy for Gaussian multi-armed bandits with known control horizon sizes $N$ and build its limiting description with a system of stochastic differential equations and ordinary differential equations. Rewards for the arms are assumed to have unknown expected values and known variances. A set of Monte-Carlo simulations was performed for the case of close distributions of rewards, when mean rewards differ by the magnitude of order $N^{-1/2}$, as it yields the highest normalized regret, to verify the validity of the obtained description. The minimal size of the control horizon when the normalized regret is not noticeably larger than maximum possible was estimated.
    Sibylvariant Transformations for Robust Text Classification. (arXiv:2205.05137v1 [cs.CL])
    The vast majority of text transformation techniques in NLP are inherently limited in their ability to expand input space coverage due to an implicit constraint to preserve the original class label. In this work, we propose the notion of sibylvariance (SIB) to describe the broader set of transforms that relax the label-preserving constraint, knowably vary the expected class, and lead to significantly more diverse input distributions. We offer a unified framework to organize all data transformations, including two types of SIB: (1) Transmutations convert one discrete kind into another, (2) Mixture Mutations blend two or more classes together. To explore the role of sibylvariance within NLP, we implemented 41 text transformations, including several novel techniques like Concept2Sentence and SentMix. Sibylvariance also enables a unique form of adaptive training that generates new input mixtures for the most confused class pairs, challenging the learner to differentiate with greater nuance. Our experiments on six benchmark datasets strongly support the efficacy of sibylvariance for generalization performance, defect detection, and adversarial robustness.
    DNA data storage, sequencing data-carrying DNA. (arXiv:2205.05488v1 [cs.ET])
    DNA is a leading candidate as the next archival storage media due to its density, durability and sustainability. To read (and write) data DNA storage exploits technology that has been developed over decades to sequence naturally occurring DNA in the life sciences. To achieve higher accuracy for previously unseen, biological DNA, sequencing relies on extending and training deep machine learning models known as basecallers. This growth in model complexity requires substantial resources, both computational and data sets. It also eliminates the possibility of a compact read head for DNA as a storage medium. We argue that we need to depart from blindly using sequencing models from the life sciences for DNA data storage. The difference is striking: for life science applications we have no control over the DNA, however, in the case of DNA data storage, we control how it is written, as well as the particular write head. More specifically, data-carrying DNA can be modulated and embedded with alignment markers and error correcting codes to guarantee higher fidelity and to carry out some of the work that the machine learning models perform. In this paper, we study accuracy trade-offs between deep model size and error correcting codes. We show that, starting with a model size of 107MB, the reduced accuracy from model compression can be compensated by using simple error correcting codes in the DNA sequences. In our experiments, we show that a substantial reduction in the size of the model does not incur an undue penalty for the error correcting codes used, therefore paving the way for portable data-carrying DNA read head. Crucially, we show that through the joint use of model compression and error correcting codes, we achieve a higher read accuracy than without compression and error correction codes.
    A Unified f-divergence Framework Generalizing VAE and GAN. (arXiv:2205.05214v1 [stat.ML])
    Developing deep generative models that flexibly incorporate diverse measures of probability distance is an important area of research. Here we develop an unified mathematical framework of f-divergence generative model, f-GM, that incorporates both VAE and f-GAN, and enables tractable learning with general f-divergences. f-GM allows the experimenter to flexibly design the f-divergence function without changing the structure of the networks or the learning procedure. f-GM jointly models three components: a generator, a inference network and a density estimator. Therefore it simultaneously enables sampling, posterior inference of the latent variable as well as evaluation of the likelihood of an arbitrary datum. f-GM belongs to the class of encoder-decoder GANs: our density estimator can be interpreted as playing the role of a discriminator between samples in the joint space of latent code and observed space. We prove that f-GM naturally simplifies to the standard VAE and to f-GAN as special cases, and illustrates the connections between different encoder-decoder GAN architectures. f-GM is compatible with general network architecture and optimizer. We leverage it to experimentally explore the effects -- e.g. mode collapse and image sharpness -- of different choices of f-divergence.
    REIN-2: Giving Birth to Prepared Reinforcement Learning Agents Using Reinforcement Learning Agents. (arXiv:2110.05128v2 [cs.LG] UPDATED)
    Deep Reinforcement Learning (Deep RL) has been in the spotlight for the past few years, due to its remarkable abilities to solve problems which were considered to be practically unsolvable using traditional Machine Learning methods. However, even state-of-the-art Deep RL algorithms have various weaknesses that prevent them from being used extensively within industry applications, with one such major weakness being their sample-inefficiency. In an effort to patch these issues, we integrated a meta-learning technique in order to shift the objective of learning to solve a task into the objective of learning how to learn to solve a task (or a set of tasks), which we empirically show that improves overall stability and performance of Deep RL algorithms. Our model, named REIN-2, is a meta-learning scheme formulated within the RL framework, the goal of which is to develop a meta-RL agent (meta-learner) that learns how to produce other RL agents (inner-learners) that are capable of solving given environments. For this task, we convert the typical interaction of an RL agent with the environment into a new, single environment for the meta-learner to interact with. Compared to traditional state-of-the-art Deep RL algorithms, experimental results show remarkable performance of our model in popular OpenAI Gym environments in terms of scoring and sample efficiency, including the Mountain Car hard-exploration environment.
    Advanced sleep spindle identification with neural networks. (arXiv:2202.05158v2 [eess.SP] UPDATED)
    Sleep spindles are neurophysiological phenomena that appear to be linked to memory formation and other functions of the central nervous system, and that can be observed in electroencephalographic recordings (EEG) during sleep. Manually identified spindle annotations in EEG recordings suffer from substantial intra- and inter-rater variability, even if raters have been highly trained, which reduces the reliability of spindle measures as a research and diagnostic tool. The Massive Online Data Annotation (MODA) project has recently addressed this problem by forming a consensus from multiple such rating experts, thus providing a corpus of spindle annotations of enhanced quality. Based on this dataset, we present a U-Net-type deep neural network model to automatically detect sleep spindles. Our model's performance exceeds that of the state-of-the-art detector and of most experts in the MODA dataset. We observed improved detection accuracy in subjects of all ages, including older individuals whose spindles are particularly challenging to detect reliably. Our results underline the potential of automated methods to do repetitive cumbersome tasks with super-human performance.
    Extracting Latent Steering Vectors from Pretrained Language Models. (arXiv:2205.05124v1 [cs.CL])
    Prior work on controllable text generation has focused on learning how to control language models through trainable decoding, smart-prompt design, or fine-tuning based on a desired objective. We hypothesize that the information needed to steer the model to generate a target sentence is already encoded within the model. Accordingly, we explore a different approach altogether: extracting latent vectors directly from pretrained language model decoders without fine-tuning. Experiments show that there exist steering vectors, which, when added to the hidden states of the language model, generate a target sentence nearly perfectly (> 99 BLEU) for English sentences from a variety of domains. We show that vector arithmetic can be used for unsupervised sentiment transfer on the Yelp sentiment benchmark, with performance comparable to models tailored to this task. We find that distances between steering vectors reflect sentence similarity when evaluated on a textual similarity benchmark (STS-B), outperforming pooled hidden states of models. Finally, we present an analysis of the intrinsic properties of the steering vectors. Taken together, our results suggest that frozen LMs can be effectively controlled through their latent steering space.
    Evaluation Gaps in Machine Learning Practice. (arXiv:2205.05256v1 [cs.LG])
    Forming a reliable judgement of a machine learning (ML) model's appropriateness for an application ecosystem is critical for its responsible use, and requires considering a broad range of factors including harms, benefits, and responsibilities. In practice, however, evaluations of ML models frequently focus on only a narrow range of decontextualized predictive behaviours. We examine the evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations. Through an empirical study of papers from recent high-profile conferences in the Computer Vision and Natural Language Processing communities, we demonstrate a general focus on a handful of evaluation methods. By considering the metrics and test data distributions used in these methods, we draw attention to which properties of models are centered in the field, revealing the properties that are frequently neglected or sidelined during evaluation. By studying these properties, we demonstrate the machine learning discipline's implicit assumption of a range of commitments which have normative impacts; these include commitments to consequentialism, abstractability from context, the quantifiability of impacts, the limited role of model inputs in evaluation, and the equivalence of different failure modes. Shedding light on these assumptions enables us to question their appropriateness for ML system contexts, pointing the way towards more contextualized evaluation methodologies for robustly examining the trustworthiness of ML models
    Internet of Behavior (IoB) and Explainable AI Systems for Influencing IoT Behavior. (arXiv:2109.07239v2 [cs.DC] UPDATED)
    Pandemics and natural disasters over the years have changed the behavior of people, which has had a tremendous impact on all life aspects. With the technologies available in each era, governments, organizations, and companies have used these technologies to track, control, and influence the behavior of individuals for a benefit. Nowadays, the use of the Internet of Things (IoT), cloud computing, and artificial intelligence (AI) have made it easier to track and change the behavior of users through changing IoT behavior. This article introduces and discusses the concept of the Internet of Behavior (IoB) and its integration with Explainable AI (XAI) techniques to provide trusted and evident experience in the process of changing IoT behavior to ultimately improving users' behavior. Therefore, a system based on IoB and XAI has been proposed in a use case scenario of electrical power consumption that aims to influence user consuming behavior to reduce power consumption and cost. The scenario results showed a decrease of 522.2 kW of active power when compared to original consumption over a 200-hours period. It also showed a total power cost saving of 95.04 Euro for the same period. Moreover, decreasing the global active power will reduce the power intensity through the positive correlation.
    Is calibration a fairness requirement? An argument from the point of view of moral philosophy and decision theory. (arXiv:2205.05512v1 [cs.LG])
    In this paper, we provide a moral analysis of two criteria of statistical fairness debated in the machine learning literature: 1) calibration between groups and 2) equality of false positive and false negative rates between groups. In our paper, we focus on moral arguments in support of either measure. The conflict between group calibration vs. false positive and false negative rate equality is one of the core issues in the debate about group fairness definitions among practitioners. For any thorough moral analysis, the meaning of the term fairness has to be made explicit and defined properly. For our paper, we equate fairness with (non-)discrimination, which is a legitimate understanding in the discussion about group fairness. More specifically, we equate it with prima facie wrongful discrimination in the sense this is used in Prof. Lippert-Rasmussen's treatment of this definition. In this paper, we argue that a violation of group calibration may be unfair in some cases, but not unfair in others. This is in line with claims already advanced in the literature, that algorithmic fairness should be defined in a way that is sensitive to context. The most important practical implication is that arguments based on examples in which fairness requires between-group calibration, or equality in the false-positive/false-negative rates, do no generalize. For it may be that group calibration is a fairness requirement in one case, but not in another.
    RLOP: RL Methods in Option Pricing from a Mathematical Perspective. (arXiv:2205.05600v1 [q-fin.PR])
    Abstract In this work, we build two environments, namely the modified QLBS and RLOP models, from a mathematics perspective which enables RL methods in option pricing through replicating by portfolio. We implement the environment specifications (the source code can be found at https://github.com/owen8877/RLOP), the learning algorithm, and agent parametrization by a neural network. The learned optimal hedging strategy is compared against the BS prediction. The effect of various factors is considered and studied based on how they affect the optimal price and position.
    Stochastic Variational Smoothed Model Checking. (arXiv:2205.05398v1 [cs.LG])
    Model-checking for parametric stochastic models can be expressed as checking the satisfaction probability of a certain property as a function of the parameters of the model. Smoothed model checking (smMC) leverages Gaussian Processes (GP) to infer the satisfaction function over the entire parameter space from a limited set of observations obtained via simulation. This approach provides accurate reconstructions with statistically sound quantification of the uncertainty. However, it inherits the scalability issues of GP. In this paper, we exploit recent advances in probabilistic machine learning to push this limitation forward, making Bayesian inference of smMC scalable to larger datasets, enabling its application to larger models in terms of the dimension of the parameter set. We propose Stochastic Variational Smoothed Model Checking (SV-smMC), a solution that exploits stochastic variational inference (SVI) to approximate the posterior distribution of the smMC problem. The strength and flexibility of SVI make SV-smMC applicable to two alternative probabilistic models: Gaussian Processes (GP) and Bayesian Neural Networks (BNN). Moreover, SVI makes inference easily parallelizable and it enables GPU acceleration. In this paper, we compare the performances of smMC against those of SV-smMC by looking at the scalability, the computational efficiency and at the accuracy of the reconstructed satisfaction function.
    Federated Learning from Only Unlabeled Data with Class-Conditional-Sharing Clients. (arXiv:2204.03304v2 [cs.LG] UPDATED)
    Supervised federated learning (FL) enables multiple clients to share the trained model without sharing their labeled data. However, potential clients might even be reluctant to label their own data, which could limit the applicability of FL in practice. In this paper, we show the possibility of unsupervised FL whose model is still a classifier for predicting class labels, if the class-prior probabilities are shifted while the class-conditional distributions are shared among the unlabeled data owned by the clients. We propose federation of unsupervised learning (FedUL), where the unlabeled data are transformed into surrogate labeled data for each of the clients, a modified model is trained by supervised FL, and the wanted model is recovered from the modified model. FedUL is a very general solution to unsupervised FL: it is compatible with many supervised FL methods, and the recovery of the wanted model can be theoretically guaranteed as if the data have been labeled. Experiments on benchmark and real-world datasets demonstrate the effectiveness of FedUL. Code is available at https://github.com/lunanbit/FedUL.
    Analysis of three dimensional potential problems in non-homogeneous media with physics-informed deep collocation method using material transfer learning and sensitivity analysis. (arXiv:2010.12060v2 [cs.LG] UPDATED)
    In this work, we present a deep collocation method for three dimensional potential problems in nonhomogeneous media. This approach utilizes a physics informed neural network with material transfer learning reducing the solution of the nonhomogeneous partial differential equations to an optimization problem. We tested different cofigurations of the physics informed neural network including smooth activation functions, sampling methods for collocation points generation and combined optimizers. A material transfer learning technique is utilised for nonhomogeneous media with different material gradations and parameters, which enhance the generality and robustness of the proposed method. In order to identify the most influential parameters of the network configuration, we carried out a global sensitivity analysis. Finally, we provide a convergence proof of our DCM. The approach is validated through several benchmark problems, also testing different material variations.
    Blockchain-based Secure Client Selection in Federated Learning. (arXiv:2205.05611v1 [cs.CR])
    Despite the great potential of Federated Learning (FL) in large-scale distributed learning, the current system is still subject to several privacy issues due to the fact that local models trained by clients are exposed to the central server. Consequently, secure aggregation protocols for FL have been developed to conceal the local models from the server. However, we show that, by manipulating the client selection process, the server can circumvent the secure aggregation to learn the local models of a victim client, indicating that secure aggregation alone is inadequate for privacy protection. To tackle this issue, we leverage blockchain technology to propose a verifiable client selection protocol. Owing to the immutability and transparency of blockchain, our proposed protocol enforces a random selection of clients, making the server unable to control the selection process at its discretion. We present security proofs showing that our protocol is secure against this attack. Additionally, we conduct several experiments on an Ethereum-like blockchain to demonstrate the feasibility and practicality of our solution.
    Characterizing the Action-Generalization Gap in Deep Q-Learning. (arXiv:2205.05588v1 [cs.AI])
    We study the action generalization ability of deep Q-learning in discrete action spaces. Generalization is crucial for efficient reinforcement learning (RL) because it allows agents to use knowledge learned from past experiences on new tasks. But while function approximation provides deep RL agents with a natural way to generalize over state inputs, the same generalization mechanism does not apply to discrete action outputs. And yet, surprisingly, our experiments indicate that Deep Q-Networks (DQN), which use exactly this type of function approximator, are still able to achieve modest action generalization. Our main contribution is twofold: first, we propose a method of evaluating action generalization using expert knowledge of action similarity, and empirically confirm that action generalization leads to faster learning; second, we characterize the action-generalization gap (the difference in learning performance between DQN and the expert) in different domains. We find that DQN can indeed generalize over actions in several simple domains, but that its ability to do so decreases as the action space grows larger.
    Weak Supervision with Incremental Source Accuracy Estimation. (arXiv:2205.05302v1 [cs.LG])
    Motivated by the desire to generate labels for real-time data we develop a method to estimate the dependency structure and accuracy of weak supervision sources incrementally. Our method first estimates the dependency structure associated with the supervision sources and then uses this to iteratively update the estimated source accuracies as new data is received. Using both off-the-shelf classification models trained using publicly-available datasets and heuristic functions as supervision sources we show that our method generates probabilistic labels with an accuracy matching that of existing off-line methods.
    Aggregating Pairwise Semantic Differences for Few-Shot Claim Veracity Classification. (arXiv:2205.05646v1 [cs.CL])
    As part of an automated fact-checking pipeline, the claim veracity classification task consists in determining if a claim is supported by an associated piece of evidence. The complexity of gathering labelled claim-evidence pairs leads to a scarcity of datasets, particularly when dealing with new domains. In this paper, we introduce SEED, a novel vector-based method to few-shot claim veracity classification that aggregates pairwise semantic differences for claim-evidence pairs. We build on the hypothesis that we can simulate class representative vectors that capture average semantic differences for claim-evidence pairs in a class, which can then be used for classification of new instances. We compare the performance of our method with competitive baselines including fine-tuned BERT/RoBERTa models, as well as the state-of-the-art few-shot veracity classification method that leverages language model perplexity. Experiments conducted on the FEVER and SCIFACT datasets show consistent improvements over competitive baselines in few-shot settings. Our code is available.
    Generation of non-stationary stochastic fields using Generative Adversarial Networks with limited training data. (arXiv:2205.05469v1 [cs.LG])
    In the context of generating geological facies conditioned on observed data, samples corresponding to all possible conditions are not generally available in the training set and hence the generation of these realizations depends primary on the generalization capability of the trained generative model. The problem becomes more complex when applied on non-stationary fields. In this work, we investigate the problem of training Generative Adversarial Networks (GANs) models against a dataset of geological channelized patterns that has a few non-stationary spatial modes and examine the training and self-conditioning settings that improve the generalization capability at new spatial modes that were never seen in the given training set. The developed training method allowed for effective learning of the correlation between the spatial conditions (i.e. non-stationary maps) and the realizations implicitly without using additional loss terms or solving a costly optimization problem at the realization generation phase. Our models, trained on real and artificial datasets were able to generate geologically-plausible realizations beyond the training samples with a strong correlation with the target maps.
    Contrastive Supervised Distillation for Continual Representation Learning. (arXiv:2205.05476v1 [cs.CV])
    In this paper, we propose a novel training procedure for the continual representation learning problem in which a neural network model is sequentially learned to alleviate catastrophic forgetting in visual search tasks. Our method, called Contrastive Supervised Distillation (CSD), reduces feature forgetting while learning discriminative features. This is achieved by leveraging labels information in a distillation setting in which the student model is contrastively learned from the teacher model. Extensive experiments show that CSD performs favorably in mitigating catastrophic forgetting by outperforming current state-of-the-art methods. Our results also provide further evidence that feature forgetting evaluated in visual retrieval tasks is not as catastrophic as in classification tasks. Code at: https://github.com/NiccoBiondi/ContrastiveSupervisedDistillation.
    Predicting hot electrons free energies from ground-state data. (arXiv:2205.05591v1 [cond-mat.mtrl-sci])
    Machine-learning potentials are usually trained on the ground-state, Born-Oppenheimer energy surface, which depends exclusively on the atomic positions and not on the simulation temperature. This disregards the effect of thermally-excited electrons, that is important in metals, and essential to the description of warm dense matter. An accurate physical description of these effects requires that the nuclei move on a temperature-dependent electronic free energy. We propose a method to obtain machine-learning predictions of this free energy at an arbitrary electron temperature using exclusively training data from ground-state calculations, avoiding the need to train temperature-dependent potentials. We benchmark our method on metallic liquid hydrogen at the conditions of the core of gas giants and brown dwarfs.
    Delayed Reinforcement Learning by Imitation. (arXiv:2205.05569v1 [cs.LG])
    When the agent's observations or interactions are delayed, classic reinforcement learning tools usually fail. In this paper, we propose a simple yet new and efficient solution to this problem. We assume that, in the undelayed environment, an efficient policy is known or can be easily learned, but the task may suffer from delays in practice and we thus want to take them into account. We present a novel algorithm, Delayed Imitation with Dataset Aggregation (DIDA), which builds upon imitation learning methods to learn how to act in a delayed environment from undelayed demonstrations. We provide a theoretical analysis of the approach that will guide the practical design of DIDA. These results are also of general interest in the delayed reinforcement learning literature by providing bounds on the performance between delayed and undelayed tasks, under smoothness conditions. We show empirically that DIDA obtains high performances with a remarkable sample efficiency on a variety of tasks, including robotic locomotion, classic control, and trading.
    Efficient Distributed Framework for Collaborative Multi-Agent Reinforcement Learning. (arXiv:2205.05248v1 [cs.AI])
    Multi-agent reinforcement learning for incomplete information environments has attracted extensive attention from researchers. However, due to the slow sample collection and poor sample exploration, there are still some problems in multi-agent reinforcement learning, such as unstable model iteration and low training efficiency. Moreover, most of the existing distributed framework are proposed for single-agent reinforcement learning and not suitable for multi-agent. In this paper, we design an distributed MARL framework based on the actor-work-learner architecture. In this framework, multiple asynchronous environment interaction modules can be deployed simultaneously, which greatly improves the sample collection speed and sample diversity. Meanwhile, to make full use of computing resources, we decouple the model iteration from environment interaction, and thus accelerate the policy iteration. Finally, we verified the effectiveness of propose framework in MaCA military simulation environment and the SMAC 3D realtime strategy gaming environment with imcomplete information characteristics.  ( 2 min )
    What is Proxy Discrimination?. (arXiv:2205.05265v1 [cs.LG])
    The near universal condemnation of proxy discrimination hides a disagreement over what it is. This work surveys various notions of proxy and proxy discrimination found in prior work and represents them in a common framework. These notions variously turn on statistical dependencies, causal effects, and intentions. It discusses the limitations and uses of each notation and of the concept as a whole.
    Access Trends of In-network Cache for Scientific Data. (arXiv:2205.05563v1 [cs.NI])
    Scientific collaborations are increasingly relying on large volumes of data for their work and many of them employ tiered systems to replicate the data to their worldwide user communities. Each user in the community often selects a different subset of data for their analysis tasks; however, members of a research group often are working on related research topics that require similar data objects. Thus, there is a significant amount of data sharing possible. In this work, we study the access traces of a federated storage cache known as the Southern California Petabyte Scale Cache. By studying the access patterns and potential for network traffic reduction by this caching system, we aim to explore the predictability of the cache uses and the potential for a more general in-network data caching. Our study shows that this distributed storage cache is able to reduce the network traffic volume by a factor of 2.35 during a part of the study period. We further show that machine learning models could predict cache utilization with an accuracy of 0.88. This demonstrates that such cache usage is predictable, which could be useful for managing complex networking resources such as in-network caching.  ( 2 min )
    NDGGNET-A Node Independent Gate based Graph Neural Networks. (arXiv:2205.05348v1 [cs.LG])
    Graph Neural Networks (GNNs) is an architecture for structural data, and has been adopted in a mass of tasks and achieved fabulous results, such as link prediction, node classification, graph classification and so on. Generally, for a certain node in a given graph, a traditional GNN layer can be regarded as an aggregation from one-hop neighbors, thus a set of stacked layers are able to fetch and update node status within multi-hops. For nodes with sparse connectivity, it is difficult to obtain enough information through a single GNN layer as not only there are only few nodes directly connected to them but also can not propagate the high-order neighbor information. However, as the number of layer increases, the GNN model is prone to over-smooth for nodes with the dense connectivity, which resulting in the decrease of accuracy. To tackle this issue, in this thesis, we define a novel framework that allows the normal GNN model to accommodate more layers. Specifically, a node-degree based gate is employed to adjust weight of layers dynamically, that try to enhance the information aggregation ability and reduce the probability of over-smoothing. Experimental results show that our proposed model can effectively increase the model depth and perform well on several datasets.  ( 2 min )
    Automated differential equation solver based on the parametric approximation optimization. (arXiv:2205.05383v1 [math.NA])
    The numerical methods for differential equation solution allow obtaining a discrete field that converges towards the solution if the method is applied to the correct problem. Nevertheless, the numerical methods have the restricted class of the equations, on which the convergence with a given parameter set or range is proved. Only a few "cheap and dirty" numerical methods converge on a wide class of equations without parameter tuning with the lower approximation order price. The article presents a method that uses an optimization algorithm to obtain a solution using the parameterized approximation. The result may not be as precise as an expert one. However, it allows solving the wide class of equations in an automated manner without the algorithm's parameters change.  ( 2 min )
  • Open

    Imitation of Manipulation Skills Using Multiple Geometries. (arXiv:2203.01171v2 [cs.RO] UPDATED)
    Daily manipulation tasks are characterized by regular characteristics associated with the task structure, which can be described by multiple geometric primitives related to actions and object shapes. Such geometric descriptors can not be expressed only in Cartesian coordinate systems. In this paper, we propose a learning approach to extract the optimal representation from a dictionary of coordinate systems to represent an observed movement. This is achieved by using an extension of Gaussian distributions on Riemannian manifolds, which is used to analyse a set of user demonstrations statistically, by considering multiple geometries as candidate representations of the task. We formulate the reproduction problem as a general optimal control problem based on an iterative linear quadratic regulator (iLQR), where the Gaussian distribution in the extracted coordinate systems are used to define the cost function. We apply our approach to grasping and box opening tasks in simulation and on a 7-axis Franka Emika robot. The results show that the robot can exploit several geometries to execute the manipulation task and generalize it to new situations, by maintaining the invariant features of the skill in the coordinate system(s) of interest.
    Boundary Estimation from Point Clouds: Algorithms, Guarantees and Applications. (arXiv:2111.03217v2 [math.NA] UPDATED)
    We investigate identifying the boundary of a domain from sample points in the domain. We introduce new estimators for the normal vector to the boundary, distance of a point to the boundary, and a test for whether a point lies within a boundary strip. The estimators can be efficiently computed and are more accurate than the ones present in the literature. We provide rigorous error estimates for the estimators. Furthermore we use the detected boundary points to solve boundary-value problems for PDE on point clouds. We prove error estimates for the Laplace and eikonal equations on point clouds. Finally we provide a range of numerical experiments illustrating the performance of our boundary estimators, applications to PDE on point clouds, and tests on image data sets.
    A Model-Free Sampling Method for Estimating Basins of Attraction Using Hybrid Active Learning (HAL). (arXiv:2003.10976v3 [cs.LG] UPDATED)
    Understanding the basins of attraction (BoA) is often a paramount consideration for nonlinear systems. Most existing approaches to determining a high-resolution BoA require prior knowledge of the system's dynamical model (e.g., differential equation or point mapping for continuous systems, cell mapping for discrete systems, etc.), which allows derivation of approximate analytical solutions or parallel computing on a multi-core computer to find the BoA efficiently. However, these methods are typically impractical when the BoA must be determined experimentally or when the system's model is unknown. This paper introduces a model-free sampling method for BoA. The proposed method is based upon hybrid active learning (HAL) and is designed to find and label the "informative" samples, which efficiently determine the boundary of BoA. It consists of three primary parts: 1) additional sampling on trajectories (AST) to maximize the number of samples obtained from each simulation or experiment; 2) an active learning (AL) algorithm to exploit the local boundary of BoA; and 3) a density-based sampling (DBS) method to explore the global boundary of BoA. An example of estimating the BoA for a bistable nonlinear system is presented to show the high efficiency of our HAL sampling method.
    RvS: What is Essential for Offline RL via Supervised Learning?. (arXiv:2112.10751v2 [cs.LG] UPDATED)
    Recent work has shown that supervised learning alone, without temporal difference (TD) learning, can be remarkably effective for offline RL. When does this hold true, and which algorithmic components are necessary? Through extensive experiments, we boil supervised learning for offline RL down to its essential elements. In every environment suite we consider, simply maximizing likelihood with a two-layer feedforward MLP is competitive with state-of-the-art results of substantially more complex methods based on TD learning or sequence modeling with Transformers. Carefully choosing model capacity (e.g., via regularization or architecture) and choosing which information to condition on (e.g., goals or rewards) are critical for performance. These insights serve as a field guide for practitioners doing Reinforcement Learning via Supervised Learning (which we coin "RvS learning"). They also probe the limits of existing RvS methods, which are comparatively weak on random data, and suggest a number of open problems.  ( 2 min )
    Towards Model Agnostic Federated Learning Using Knowledge Distillation. (arXiv:2110.15210v2 [cs.LG] UPDATED)
    Is it possible to design an universal API for federated learning using which an ad-hoc group of data-holders (agents) collaborate with each other and perform federated learning? Such an API would necessarily need to be model-agnostic i.e. make no assumption about the model architecture being used by the agents, and also cannot rely on having representative public data at hand. Knowledge distillation (KD) is the obvious tool of choice to design such protocols. However, surprisingly, we show that most natural KD-based federated learning protocols have poor performance. To investigate this, we propose a new theoretical framework, Federated Kernel ridge regression, which can capture both model heterogeneity as well as data heterogeneity. Our analysis shows that the degradation is largely due to a fundamental limitation of knowledge distillation under data heterogeneity. We further validate our framework by analyzing and designing new protocols based on KD. Their performance on real world experiments using neural networks, though still unsatisfactory, closely matches our theoretical predictions.  ( 2 min )
    Benchmarking Graph Neural Networks. (arXiv:2003.00982v4 [cs.LG] UPDATED)
    In the last few years, graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs. This emerging field has witnessed an extensive growth of promising techniques that have been applied with success to computer science, mathematics, biology, physics and chemistry. But for any successful field to become mainstream and reliable, benchmarks must be developed to quantify progress. This led us in March 2020 to release a benchmark framework that i) comprises of a diverse collection of mathematical and real-world graphs, ii) enables fair model comparison with the same parameter budget to identify key architectures, iii) has an open-source, easy-to-use and reproducible code infrastructure, and iv) is flexible for researchers to experiment with new theoretical ideas. As of May 2022, the GitHub repository has reached 1,800 stars and 339 forks, which demonstrates the utility of the proposed open-source framework through the wide usage by the GNN community. In this paper, we present an updated version of our benchmark with a concise presentation of the aforementioned framework characteristics, an additional medium-sized molecular dataset AQSOL, similar to the popular ZINC, but with a real-world measured chemical target, and discuss how this framework can be leveraged to explore new GNN designs and insights. As a proof of value of our benchmark, we study the case of graph positional encoding (PE) in GNNs, which was introduced with this benchmark and has since spurred interest of exploring more powerful PE for Transformers and GNNs in a robust experimental setting.  ( 3 min )
    Stochastic Variational Smoothed Model Checking. (arXiv:2205.05398v1 [cs.LG])
    Model-checking for parametric stochastic models can be expressed as checking the satisfaction probability of a certain property as a function of the parameters of the model. Smoothed model checking (smMC) leverages Gaussian Processes (GP) to infer the satisfaction function over the entire parameter space from a limited set of observations obtained via simulation. This approach provides accurate reconstructions with statistically sound quantification of the uncertainty. However, it inherits the scalability issues of GP. In this paper, we exploit recent advances in probabilistic machine learning to push this limitation forward, making Bayesian inference of smMC scalable to larger datasets, enabling its application to larger models in terms of the dimension of the parameter set. We propose Stochastic Variational Smoothed Model Checking (SV-smMC), a solution that exploits stochastic variational inference (SVI) to approximate the posterior distribution of the smMC problem. The strength and flexibility of SVI make SV-smMC applicable to two alternative probabilistic models: Gaussian Processes (GP) and Bayesian Neural Networks (BNN). Moreover, SVI makes inference easily parallelizable and it enables GPU acceleration. In this paper, we compare the performances of smMC against those of SV-smMC by looking at the scalability, the computational efficiency and at the accuracy of the reconstructed satisfaction function.  ( 2 min )
    Externally Valid Treatment Choice. (arXiv:2205.05561v1 [econ.EM])
    We consider the problem of learning treatment (or policy) rules that are externally valid in the sense that they have welfare guarantees in target populations that are similar to, but possibly different from, the experimental population. We allow for shifts in both the distribution of potential outcomes and covariates between the experimental and target populations. This paper makes two main contributions. First, we provide a formal sense in which policies that maximize social welfare in the experimental population remain optimal for the "worst-case" social welfare when the distribution of potential outcomes (but not covariates) shifts. Hence, policy learning methods that have good regret guarantees in the experimental population, such as empirical welfare maximization, are externally valid with respect to a class of shifts in potential outcomes. Second, we develop methods for policy learning that are robust to shifts in the joint distribution of potential outcomes and covariates. Our methods may be used with experimental or observational data.  ( 2 min )
    AutoKE: An automatic knowledge embedding framework for scientific machine learning. (arXiv:2205.05390v1 [cs.LG])
    Imposing physical constraints on neural networks as a method of knowledge embedding has achieved great progress in solving physical problems described by governing equations. However, for many engineering problems, governing equations often have complex forms, including complex partial derivatives or stochastic physical fields, which results in significant inconveniences from the perspective of implementation. In this paper, a scientific machine learning framework, called AutoKE, is proposed, and a reservoir flow problem is taken as an instance to demonstrate that this framework can effectively automate the process of embedding physical knowledge. In AutoKE, an emulator comprised of deep neural networks (DNNs) is built for predicting the physical variables of interest. An arbitrarily complex equation can be parsed and automatically converted into a computational graph through the equation parser module, and the fitness of the emulator to the governing equation is evaluated via automatic differentiation. Furthermore, the fixed weights in the loss function are substituted with adaptive weights by incorporating the Lagrangian dual method. Neural architecture search (NAS) is also introduced into the AutoKE to select an optimal network architecture of the emulator according to the specific problem. Finally, we apply transfer learning to enhance the scalability of the emulator. In experiments, the framework is verified by a series of physical problems in which it can automatically embed physical knowledge into an emulator without heavy hand-coding. The results demonstrate that the emulator can not only make accurate predictions, but also be applied to similar problems with high efficiency via transfer learning.  ( 2 min )
    Probability Distribution of Hypervolume Improvement in Bi-objective Bayesian Optimization. (arXiv:2205.05505v1 [cs.LG])
    This work provides the exact expression of the probability distribution of the hypervolume improvement (HVI) for bi-objective generalization of Bayesian optimization. Here, instead of a single-objective improvement, we consider the improvement of the hypervolume indicator concerning the current best approximation of the Pareto front. Gaussian process regression models are trained independently on both objective functions, resulting in a bi-variate separated Gaussian distribution serving as a predictive model for the vector-valued objective function. Some commonly HVI-based acquisition functions (probability of improvement and upper confidence bound) are also leveraged with the help of the exact distribution of HVI. In addition, we show the superior numerical accuracy and efficiency of the exact distribution compared to the commonly used approximation by Monte-Carlo sampling. Finally, we benchmark distribution-leveraged acquisition functions on the widely applied ZDT problem set, demonstrating a significant advantage of using the exact distribution of HVI in multi-objective Bayesian optimization.  ( 2 min )
    Exploring Local Explanations of Nonlinear Models Using Animated Linear Projections. (arXiv:2205.05359v1 [stat.ML])
    The increased predictive power of nonlinear models comes at the cost of interpretability of its terms. This trade-off has led to the emergence of eXplainable AI (XAI). XAI attempts to shed light on how models use predictors to arrive at a prediction with local explanations, a point estimate of the linear feature importance in the vicinity of one instance. These can be considered linear projections and can be further explored to understand better the interactions between features used to make predictions across the predictive model surface. Here we describe interactive linear interpolation used for exploration at any instance and illustrate with examples with categorical (penguin species, chocolate types) and quantitative (soccer/football salaries, house prices) output. The methods are implemented in the R package cheem, available on CRAN.  ( 2 min )
    Variational Autoencoder Leveraged MMSE Channel Estimation. (arXiv:2205.05345v1 [eess.SP])
    We propose to utilize a variational autoencoder (VAE) for data-driven channel estimation. The underlying true and unknown channel distribution is modeled by the VAE as a conditional Gaussian distribution in a novel way, parameterized by the respective first and second order conditional moments. As a result, it can be observed that the linear minimum mean square error (LMMSE) estimator in its variant conditioned on the latent sample of the VAE approximates an optimal MSE estimator. Furthermore, we argue how a VAE-based channel estimator can approximate the MMSE channel estimator. We propose three variants of VAE estimators that differ in the data used during training and estimation. First, we show that given perfectly known channel state information at the input of the VAE during estimation, which is impractical, we obtain an estimator that can serve as a benchmark result for an estimation scenario. We then propose practically feasible approaches, where perfectly known channel state information is only necessary in the training phase or is not needed at all. Simulation results on 3GPP and QuaDRiGa channel data attest a small performance loss of the practical approaches and the superiority of our VAE approaches in comparison to other related channel estimation methods.  ( 2 min )
    Learning Multitask Gaussian Bayesian Networks. (arXiv:2205.05343v1 [stat.ML])
    Major depressive disorder (MDD) requires study of brain functional connectivity alterations for patients, which can be uncovered by resting-state functional magnetic resonance imaging (rs-fMRI) data. We consider the problem of identifying alterations of brain functional connectivity for a single MDD patient. This is particularly difficult since the amount of data collected during an fMRI scan is too limited to provide sufficient information for individual analysis. Additionally, rs-fMRI data usually has the characteristics of incompleteness, sparsity, variability, high dimensionality and high noise. To address these problems, we proposed a multitask Gaussian Bayesian network (MTGBN) framework capable for identifying individual disease-induced alterations for MDD patients. We assume that such disease-induced alterations show some degrees of similarity with the tool to learn such network structures from observations to understanding of how system are structured jointly from related tasks. First, we treat each patient in a class of observation as a task and then learn the Gaussian Bayesian networks (GBNs) of this data class by learning from all tasks that share a default covariance matrix that encodes prior knowledge. This setting can help us to learn more information from limited data. Next, we derive a closed-form formula of the complete likelihood function and use the Monte-Carlo Expectation-Maximization(MCEM) algorithm to search for the approximately best Bayesian network structures efficiently. Finally, we assess the performance of our methods with simulated and real-world rs-fMRI data.  ( 2 min )
    Analysis of convolutional neural network image classifiers in a rotationally symmetric model. (arXiv:2205.05500v1 [stat.ML])
    Convolutional neural network image classifiers are defined and the rate of convergence of the misclassification risk of the estimates towards the optimal misclassification risk is analyzed. Here we consider images as random variables with values in some functional space, where we only observe discrete samples as function values on some finite grid. Under suitable structural and smoothness assumptions on the functional a posteriori probability, which includes some kind of symmetry against rotation of subparts of the input image, it is shown that least squares plug-in classifiers based on convolutional neural networks are able to circumvent the curse of dimensionality in binary image classification if we neglect a resolution-dependent error term. The finite sample size behavior of the classifier is analyzed by applying it to simulated and real data.  ( 2 min )
    On Distributed Adaptive Optimization with Gradient Compression. (arXiv:2205.05632v1 [stat.ML])
    We study COMP-AMS, a distributed optimization framework based on gradient averaging and adaptive AMSGrad algorithm. Gradient compression with error feedback is applied to reduce the communication cost in the gradient transmission process. Our convergence analysis of COMP-AMS shows that such compressed gradient averaging strategy yields same convergence rate as standard AMSGrad, and also exhibits the linear speedup effect w.r.t. the number of local workers. Compared with recently proposed protocols on distributed adaptive methods, COMP-AMS is simple and convenient. Numerical experiments are conducted to justify the theoretical findings, and demonstrate that the proposed method can achieve same test accuracy as the full-gradient AMSGrad with substantial communication savings. With its simplicity and efficiency, COMP-AMS can serve as a useful distributed training framework for adaptive gradient methods.  ( 2 min )
    Stable and Interpretable Unrolled Dictionary Learning. (arXiv:2106.00058v4 [cs.LG] UPDATED)
    The dictionary learning problem, representing data as a combination of a few atoms, has long stood as a popular method for learning representations in statistics and signal processing. The most popular dictionary learning algorithm alternates between sparse coding and dictionary update steps, and a rich literature has studied its theoretical convergence. The success of dictionary learning relies on access to a ``good'' initial estimate of the dictionary and the ability of the sparse coding step to provide an unbiased estimate of the code. The growing popularity of unrolled sparse coding networks has led to the empirical finding that backpropagation through such networks performs dictionary learning. We offer the first theoretical analysis of these empirical results through PUDLE, a Provable Unrolled Dictionary LEarning method. We provide conditions on the network initialization and data distribution sufficient to recover and preserve the support of the latent sparse representation. Additionally, we address two challenges; first, the vanilla unrolled sparse coding computes a biased code estimate, and second, gradients during backpropagated learning can become unstable. We show approaches to reduce the bias of the code estimate in the forward pass, and that of the dictionary estimate in the backward pass. We propose strategies to resolve the learning instability. This is achieved by tuning network parameters and modifying the loss function. Overall, we highlight the impact of loss, unrolling, and backpropagation on convergence. We complement our findings through synthetic and image denoising experiments. Finally, we demonstrate PUDLE's interpretability, a driving factor in designing deep networks based on iterative optimizations, by building a mathematical relation between network weights, its output, and the training set.  ( 2 min )
    A Framework for Machine Learning of Model Error in Dynamical Systems. (arXiv:2107.06658v2 [math.DS] UPDATED)
    The development of data-informed predictive models for dynamical systems is of widespread interest in many disciplines. We present a unifying framework for blending mechanistic and machine-learning approaches to identify dynamical systems from noisily and partially observed data. We compare pure data-driven learning with hybrid models which incorporate imperfect domain knowledge. Our formulation is agnostic to the chosen machine learning model, is presented in both continuous- and discrete-time settings, and is compatible both with model errors that exhibit substantial memory and errors that are memoryless. First, we study memoryless linear (w.r.t. parametric-dependence) model error from a learning theory perspective, defining excess risk and generalization error. For ergodic continuous-time systems, we prove that both excess risk and generalization error are bounded above by terms that diminish with the square-root of T, the time-interval over which training data is specified. Secondly, we study scenarios that benefit from modeling with memory, proving universal approximation theorems for two classes of continuous-time recurrent neural networks (RNNs): both can learn memory-dependent model error. In addition, we connect one class of RNNs to reservoir computing, thereby relating learning of memory-dependent error to recent work on supervised learning between Banach spaces using random features. Numerical results are presented (Lorenz '63, Lorenz '96 Multiscale systems) to compare purely data-driven and hybrid approaches, finding hybrid methods less data-hungry and more parametrically efficient. Finally, we demonstrate numerically how data assimilation can be leveraged to learn hidden dynamics from noisy, partially-observed data, and illustrate challenges in representing memory by this approach, and in the training of such models.  ( 2 min )
    AutoTransfer: Subject Transfer Learning with Censored Representations on Biosignals Data. (arXiv:2112.09796v2 [cs.LG] UPDATED)
    We provide a regularization framework for subject transfer learning in which we seek to train an encoder and classifier to minimize classification loss, subject to a penalty measuring independence between the latent representation and the subject label. We introduce three notions of independence and corresponding penalty terms using mutual information or divergence as a proxy for independence. For each penalty term, we provide several concrete estimation algorithms, using analytic methods as well as neural critic functions. We provide a hands-off strategy for applying this diverse family of regularization algorithms to a new dataset, which we call "AutoTransfer". We evaluate the performance of these individual regularization strategies and our AutoTransfer method on EEG, EMG, and ECoG datasets, showing that these approaches can improve subject transfer learning for challenging real-world datasets.  ( 2 min )
    Contextual Search in the Presence of Adversarial Corruptions. (arXiv:2002.11650v5 [cs.LG] UPDATED)
    We study contextual search, a generalization of binary search in higher dimensions, which captures settings such as feature-based dynamic pricing. Standard formulations of this problem assume that agents act in accordance with a specific homogeneous response model. In practice, however, some responses may be adversarially corrupted. Existing algorithms heavily depend on the assumed response model being (approximately) accurate for all agents and have poor performance in the presence of even a few such arbitrary misspecifications. We initiate the study of contextual search when some of the agents can behave in ways inconsistent with the underlying response model. In particular, we provide two algorithms, one based on multidimensional binary search methods and one based on gradient descent. We show that these algorithms attain near-optimal regret in the absence of adversarial corruptions and their performance degrades gracefully with the number of such agents, providing the first results for contextual search in any adversarial noise model. Our techniques draw inspiration from learning theory, game theory, high-dimensional geometry, and convex analysis.  ( 2 min )
    Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise. (arXiv:2102.04297v4 [cs.LG] UPDATED)
    The empirical success of deep learning is often attributed to SGD's mysterious ability to avoid sharp local minima in the loss landscape, as sharp minima are known to lead to poor generalization. Recently, empirical evidence of heavy-tailed gradient noise was reported in many deep learning tasks, and it was shown in \c{S}im\c{s}ekli (2019a,b) that SGD can escape sharp local minima under the presence of such heavy-tailed gradient noise, providing a partial solution to the mystery. In this work, we analyze a popular variant of SGD where gradients are truncated above a fixed threshold. We show that it achieves a stronger notion of avoiding sharp minima: it can effectively eliminate sharp local minima entirely from its training trajectory. We characterize the dynamics of truncated SGD driven by heavy-tailed noises. First, we show that the truncation threshold and width of the attraction field dictate the order of the first exit time from the associated local minimum. Moreover, when the objective function satisfies appropriate structural conditions, we prove that as the learning rate decreases, the dynamics of heavy-tailed truncated SGD closely resemble those of a continuous-time Markov chain that never visits any sharp minima. Real data experiments on deep learning confirm our theoretical prediction that heavy-tailed SGD with gradient clipping finds a "flatter" local minima and achieves better generalization.  ( 2 min )
    DoubleMatch: Improving Semi-Supervised Learning with Self-Supervision. (arXiv:2205.05575v1 [cs.LG])
    Following the success of supervised learning, semi-supervised learning (SSL) is now becoming increasingly popular. SSL is a family of methods, which in addition to a labeled training set, also use a sizable collection of unlabeled data for fitting a model. Most of the recent successful SSL methods are based on pseudo-labeling approaches: letting confident model predictions act as training labels. While these methods have shown impressive results on many benchmark datasets, a drawback of this approach is that not all unlabeled data are used during training. We propose a new SSL algorithm, DoubleMatch, which combines the pseudo-labeling technique with a self-supervised loss, enabling the model to utilize all unlabeled data in the training process. We show that this method achieves state-of-the-art accuracies on multiple benchmark datasets while also reducing training times compared to existing SSL methods. Code is available at https://github.com/walline/doublematch.  ( 2 min )
    A Unified f-divergence Framework Generalizing VAE and GAN. (arXiv:2205.05214v1 [stat.ML])
    Developing deep generative models that flexibly incorporate diverse measures of probability distance is an important area of research. Here we develop an unified mathematical framework of f-divergence generative model, f-GM, that incorporates both VAE and f-GAN, and enables tractable learning with general f-divergences. f-GM allows the experimenter to flexibly design the f-divergence function without changing the structure of the networks or the learning procedure. f-GM jointly models three components: a generator, a inference network and a density estimator. Therefore it simultaneously enables sampling, posterior inference of the latent variable as well as evaluation of the likelihood of an arbitrary datum. f-GM belongs to the class of encoder-decoder GANs: our density estimator can be interpreted as playing the role of a discriminator between samples in the joint space of latent code and observed space. We prove that f-GM naturally simplifies to the standard VAE and to f-GAN as special cases, and illustrates the connections between different encoder-decoder GAN architectures. f-GM is compatible with general network architecture and optimizer. We leverage it to experimentally explore the effects -- e.g. mode collapse and image sharpness -- of different choices of f-divergence.  ( 2 min )
    Regression-based projection for learning Mori--Zwanzig operators. (arXiv:2205.05135v1 [math.DS])
    We propose to adopt statistical regression as the projection operator to enable data-driven learning of the operators in the Mori--Zwanzig formalism. We present a principled algorithm to extract the Markov and memory operators for any regression models. We show that the choice of linear regression results in a recently proposed data-driven learning algorithm based on Mori's projection operator, which can be considered as a higher-order approximate Koopman learning method. We show that more expressive, potentially nonlinear regression models naturally fill in the gap between the highly idealized and computationally efficient Mori's projection operator and the most optimal yet computationally infeasible Zwanzig projection operator. We performed numerical experiments and extracted the operators for an array of regression-based projections, including linear, polynomial, spline, and neural-network-based regression, showing a progressive improvement as the complexity of the regression model increased. Our proposition provides a general framework to extract memory-dependent corrections and can be readily applied to an array of data-driven learning methods for stationary dynamical systems in the literature.  ( 2 min )
    Bias and Priors in Machine Learning Calibrations for High Energy Physics. (arXiv:2205.05084v1 [hep-ph])
    Machine learning offers an exciting opportunity to improve the calibration of nearly all reconstructed objects in high-energy physics detectors. However, machine learning approaches often depend on the spectra of examples used during training, an issue known as prior dependence. This is an undesirable property of a calibration, which needs to be applicable in a variety of environments. The purpose of this paper is to explicitly highlight the prior dependence of some machine learning-based calibration strategies. We demonstrate how some recent proposals for both simulation-based and data-based calibrations inherit properties of the sample used for training, which can result in biases for downstream analyses. In the case of simulation-based calibration, we argue that our recently proposed Gaussian Ansatz approach can avoid some of the pitfalls of prior dependence, whereas prior-independent data-based calibration remains an open problem.  ( 2 min )
    Generalized Fast Multichannel Nonnegative Matrix Factorization Based on Gaussian Scale Mixtures for Blind Source Separation. (arXiv:2205.05330v1 [cs.SD])
    This paper describes heavy-tailed extensions of a state-of-the-art versatile blind source separation method called fast multichannel nonnegative matrix factorization (FastMNMF) from a unified point of view. The common way of deriving such an extension is to replace the multivariate complex Gaussian distribution in the likelihood function with its heavy-tailed generalization, e.g., the multivariate complex Student's t and leptokurtic generalized Gaussian distributions, and tailor-make the corresponding parameter optimization algorithm. Using a wider class of heavy-tailed distributions called a Gaussian scale mixture (GSM), i.e., a mixture of Gaussian distributions whose variances are perturbed by positive random scalars called impulse variables, we propose GSM-FastMNMF and develop an expectationmaximization algorithm that works even when the probability density function of the impulse variables have no analytical expressions. We show that existing heavy-tailed FastMNMF extensions are instances of GSM-FastMNMF and derive a new instance based on the generalized hyperbolic distribution that include the normal-inverse Gaussian, Student's t, and Gaussian distributions as the special cases. Our experiments show that the normalinverse Gaussian FastMNMF outperforms the state-of-the-art FastMNMF extensions and ILRMA model in speech enhancement and separation in terms of the signal-to-distortion ratio.  ( 2 min )

  • Open

    MindsEye Lite run multiple text-to-image models in one place
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    Terraform Provider Iterative (TPI) Helps ML/AI Teams Save Resources And Money While Switching Cloud Providers
    submitted by /u/thumbsdrivesmecrazy [link] [comments]
    What is the best learning class for an IA to play a Cards Game?
    I'm doing my project and I'm not sure what apprenticeship is the best for this kind of games what do you think? submitted by /u/Merelex_Buk-12 [link] [comments]
    Aimlflow - Aim-based experiment comparison over your MLflow logs
    🙌 Hey all, check out our work on AiMLflow! We are building a tool that mounts the MLflow logs and enables an Aim-based super-performant UI for metric, image and other ML metadata comparison. If you are using MLflow, we would love to chat with you and share our progress on the project for feedback. https://aimstack.io/aimlflow submitted by /u/ManeSa [link] [comments]  ( 1 min )
    This UBIAI article and tutorial explains the prediction of named entity recognition ( NER) model using local interpretable model-agnostic explanations ( LIME) Check more articles and tutorials at : ubiai.tools
    submitted by /u/UBIAI [link] [comments]  ( 1 min )
    Use Of AI To Reduce Street Crimes And Improving Civil Security
    In this digital era, several law enforcement agencies across the globe are leveraging artificial intelligence (AI) to resolve more criminal cases in a very short time equipped with AI algorithms developed to identify, locate and arrest the real or potential criminals faster than ever. Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    Is AI eco-friendly?
    Artificial Intelligence (AI) systems can be used to reduce carbon emissions and eventually environmental degradation. It has the potential to understand the surroundings autonomously, learn from experience, make and implement decisions, and communicate with the surroundings. Read more submitted by /u/ridamughal110 [link] [comments]
    Interview with a computer scientist who works on AI & social/crowd computing. We speak about her career & life. If you are interested please subscribe as I'll be posting more AI interviews soon! :)
    submitted by /u/joemurray1994 [link] [comments]  ( 1 min )
    What is the difference between face detection and face recognition from an iOS perspective?
    submitted by /u/adilonreddit1 [link] [comments]  ( 1 min )
    Awesome Artificial Intelligence content (with code!) on Computer Vision News of May 2022
    Dear all, Awesome AI R&D content (with code!) on Computer Vision News of May 2022. Many great articles about AI, Deep Learning, Computer Vision and more... Review of award-winning ISBI 2022 paper. HTML5 version (recommended) PDF version Dilbert on page 2. Free subscription on page 60. Enjoy! https://preview.redd.it/gcc0qyxozuy81.jpg?width=400&format=pjpg&auto=webp&s=3a5d7913b9cc70bf25762d05f724115fd04d0123 submitted by /u/Gletta [link] [comments]  ( 1 min )
    AR/VR: The Next Frontier in Banking and Financial Services
    https://blog.r2c.io/ar-vr-the-next-frontier-in-banking-and-financial-services/ submitted by /u/R2Consulting [link] [comments]
    Some results look interesting!
    submitted by /u/limapedro [link] [comments]
    100+ Cheat Sheets for Machine Learning, Deep Learning and Data Science
    submitted by /u/vadhavaniyafaijan [link] [comments]
    Speech Recognition with TensorFlow.js - Voice Commands
    submitted by /u/RubiksCodeNMZ [link] [comments]
    The results of the AI experiment/survey I conducted on this sub a short time ago are here (link to the full study in the comment)
    submitted by /u/KazRainer [link] [comments]  ( 1 min )
    [P] PaddleOCR: An awesome and easy-to-use OCR system, which provides more than 80 kinds of multi-language recognition models.
    Hi, all, I am glad to share an open source repository PaddleOCR, which provides more than 80 kinds of multi-language recognition models, including English, Chinese, French, German, Arabic, Korean, Japanese and so on. Code:https://github.com/PaddlePaddle/PaddleOCR ​ Features Set: Ultra-lightweight OCR system: detection (3.6M) + direction classifier (1.4M) + recognition (12M) = 17.0M Support more than 80 kinds of multi-language recognition models, including English, Chinese, French, German, Arabic, Korean, Japanese and so on Semi-automatic data annotation tool PPOCRLabel: support rectangular box, irregular text, table and key information annotation modes Data synthesis tool, i.e., Style-Text: easy to synthesize a large number of images which are similar to the target scene image Support PIP installation, easy to use Support Linux, Windows, MacOS and other systems Apache-2.0 license ​ https://preview.redd.it/dkjcp8ln2ty81.png?width=795&format=png&auto=webp&s=e3751349712c7006c94e744f240f9207ad8d1b2d https://preview.redd.it/i95vto1p2ty81.png?width=795&format=png&auto=webp&s=1a1c3b58ccfb5f8fc403ba387a3a9fda6587e33d https://preview.redd.it/swlmsylm2ty81.png?width=795&format=png&auto=webp&s=a5567cbaa76e395330f75e9eba09100655974725 submitted by /u/Aha_IamDaniel [link] [comments]  ( 1 min )
    MELODIES POSITIVE: Bottle Art.
    submitted by /u/cookingandcraft [link] [comments]
    Self-hosted 'replika-like' chatbot using VA Framework and Oryzer Studio
    I purchased VA Framework Standard Edition at a great price during the Steam Holiday sale at Christmas and played around with it a bit when I first got it, but it has been sitting idle since then and I decided to start messing with it again. What attracted me to this was being combined with Oryzer Studio and its flow-based programming and already being attached to a 3D avatar interface with speech engine. What I am trying to do is create a personal self-hosted 'replika-like' chatbot that can be trained and learn from my input. Based on my research and communication with the developer prior to purchasing this, what I am hoping to accomplish should easily be obtainable using this. The problem is that the documentation and sample projects available are very sparse and what is available seems to be geared more towards corporate customer systems rather than personal systems. I have limited programming experience, mostly Visual Basic from when I was in college, which is why Oryzer studio seemed perfect for me. What I am hoping is that there is someone else who has experience with it and may be able to assist with getting me started or at least point me in the direction of some documentation and/or sample projects, other than the ones provided on the developer's website, that I can build off of. VA Framework, which I purchased, is the 3D Avatar and speech engine and Oryzer Studio, which is free, is used to create the 'workspace' files used by VA Framework. Both are designed to work together, although Oryzer Studio does not require VA Framework to function and is freely available to anyone who wanted to look at it. I thank you in advance for any assistance you can provide submitted by /u/Normmstein [link] [comments]  ( 1 min )
    How to Make Slow Motion Videos With AI ! TimeLens Explained
    submitted by /u/OnlyProggingForFun [link] [comments]
  • Open

    [D] What is the best machine learning algorithm to classify seemingly random data?
    I am a student Computer science. For my bachelor's internship I have to research the possibillity of predicting whether a new client will buy a house or not based on their website-activity/behaviour so the real-estate company can prioritise the most likely buyers. The real-estate company assured me they had loads of data but nope.... . I have data of about 1400 clients of which 150 bought a house. When we learn about new algorithms we usually have a perfect dataset to work with but this is the real world where everything is chaos. I've tried KNN, SVM, Gaussian mixture and XGboost. But they all underperform because there is very little data and all of it is seemingly random. So What is the best machine learning algorithm to classify seemingly random data? Or is it impossible to get any reasonable prediction out of it? For my research it is however perfectly acceptable to conclude it to be impossible to predict anything given these circumstances. Thanks in advance! submitted by /u/Fine-Coffee6171 [link] [comments]  ( 2 min )
    [D] Any recommendations for image annotation software .
    I've been using Supervisely Enterprise in work and it's been going really really well, but my work has been paying for it. Meanwhile, my grad research work is requiring a large-scale annotation effort and my advisor is queasy about forking over grant money just to annotate. We can't use the community version because HIPAA. I've been looking at other free annotation software and I'm trying to make a decision on‍ which to use. Any suggestions on what has worked for you? submitted by /u/aygupt1822 [link] [comments]  ( 1 min )
    [D] I’m writing a msc thesis on synthetic data for image classification. I am wondering which kinds of data visualization techniques (of real data)/EDA I could add.
    The task is recognition of jersey numbers on the back of sport players. The dataset contains the picture, the label and a bounding box containing the jersey number. The only ideas I had was plotting the distribution of the bounding box in the image, and the histogram of labels. I am not sure if it makes sense to plot the distribution of image resolutions. submitted by /u/TheManveru [link] [comments]  ( 2 min )
    [P] Nftopia.ai - Visual semantic search for NFTs
    Hi! My colleagues & I indexed about 1.1 million NFT images from across 21 thousand collections using OpenAI's CLIP and put it behind a website. Please let us know what you think! Specifically, we embed all these images using clip & expose two functionalities. The search bar embeds the input query & retrieves similar images. The "+more like this" is image search that retrieves other NFTs similar to it in the image space. We use Pinecone to power the approximate nearest neighbor search. The web stack uses Django in the front + flask app that interfaces with Pinecone. (Warning: we've noticed several NSFW NFTs that can pop up occasionally even for unrelated search queries) submitted by /u/hayAbhay [link] [comments]  ( 1 min )
    [N] Awesome AI R&D content (with code!) on Computer Vision News of May 2022
    Dear all, Awesome R&D content (with code!) on Computer Vision News of May 2022. Many great articles about AI, Deep Learning, Computer Vision and more... Review of award-winning ISBI 2022 paper. HTML5 version (recommended) PDF version Dilbert on page 2. Free subscription on page 60. Enjoy! https://preview.redd.it/cdg07kpfnuy81.jpg?width=400&format=pjpg&auto=webp&s=9ff68e58648a0dafc95b9416bc1a7e194ab1b898 submitted by /u/Gletta [link] [comments]  ( 1 min )
    [D] Can you recommend this book? Analytical Skills for AI and Data Science
    Recently, I am stumbling frequently across this book: https://www.oreilly.com/library/view/analytical-skills-for/9781492060932/. I am thinking about buying it. Are here some people who own this book and can give some recommendations? submitted by /u/rightkill [link] [comments]  ( 1 min )
    [P] Recognize similar nodes in chronologically evolving graph
    I’ve collected a data set of around 5 million unique ids which transact values with each other in one directional ways at some point of time. I understand that this can easily be visualized as a graph where each node is an id and interaction is a weighted edge assigned a tuple (value, time). I now want to build a model that finds similar behaving nodes. The goal is to look at the current state of the graph and use the model to predict what’s likely to happen with nodes that have joined the graph relatively late (=first transaction isnt old) Im sure there has been work done that can help me find hints how to do this. Are there papers/models you would you recommend to look at? submitted by /u/SomewhereOld6859 [link] [comments]  ( 1 min )
    [R] [CVPR 2022 Oral] Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation
    We present an end-to-end deep view aggregation method for 3D semantic segmentation from images and point clouds. We reach SOTA on S3DIS and KITTI360 without requiring point cloud colorization, meshing, or depth sensors: just point clouds, images, and their poses. preprint | code | paperwithcode submitted by /u/grad_student_descent [link] [comments]  ( 1 min )
    [P] I wrote an article on Uplift Modeling to target the "persuadables"
    Hey everyone! As the title says, I recently wrote a technical article where I built an uplift model to increase marketing ROIs by targeting the right group of people (the persuadables). Here's the link: https://towardsdatascience.com/targeting-the-right-group-with-uplift-modelling-5682de2dff8b Curious to know if anyone has successfully used uplift modeling in their industry or field? And let me know what you think. Any feedback is greatly appreciated! submitted by /u/ramanansubramanian [link] [comments]  ( 1 min )
    [D] Is there a tool to visualise my neural network in real time?
    I would to know if there are any tool that I can use in order to visualise my NN in real time noting that I use keras. I want something like shown in this video screenshot : ​ https://preview.redd.it/yut97aa7ysy81.png?width=1969&format=png&auto=webp&s=6704667a483810d1614c90075335faa75e1ec0e7 submitted by /u/aymenboufe [link] [comments]  ( 1 min )
    [D] Is there any way to artificially create a probability calibration for data coming from another model?
    I have probability predictions, which come from a survival model, this model gives me very low probabilities, and I am not sure if they fulfill the real probability of the phenomenon. For example, I calculate P(T≤t+d|T>t) and the probabilities are very low (with d=180). To summarize, I need these probabilities to be on average another number (let's say 0.2). Is it possible to create an artificial calibration with only this number (the desired average) as the input? I have thought of creating a vector of size n equal to the size of the original dataset with distribution Xi∼Bernoulli(p=0.2) and assign its ones to the top np probabilities and its zeros to the latest n(1−p). Which would result in a table with a column of probabilities obtained with the survival model and another column with a 0 or 1 depending on the said probability. After getting this table, I would simply use CalibratedClassifierCV from scikit-learn. Could this be the correct way? submitted by /u/Jhones_edlc [link] [comments]  ( 1 min )
    [D] Fastest way to calculate distance (drift) between vectors - at scale (billions)
    I'm looking for the fastest way to calculate distance (drift) between vectors - at scale (billions). What do I mean? For example, there are two sets of 1 billions vectors, each. I want to know how far/different one of the sets is from the other. Each vector has 768 dimensions. submitted by /u/igaloly [link] [comments]  ( 2 min )
    [R][D] Cluster Sentences
    I have a use-case where I need to cluster/combine related sentences together. For example : Sentences ​ https://preview.redd.it/suq97dy5bsy81.png?width=683&format=png&auto=webp&s=7d2a3168e755a8fb390e384481f4a7e37a84966f Related sentences together ​ https://preview.redd.it/absdu995bsy81.png?width=683&format=png&auto=webp&s=5b9285b3b0d1203e3e50457447a1086cae9c821c Constraint: We do not know the number of clusters beforehand (K-Means cluster is not much useful here) ​ The following does not give satisfactory solution https://ntropy.com/post/clustering-millions-of-sentences-to-optimize-the-ml-workflow Can anyone please help with the approach/pointers? submitted by /u/Expert-Departure-236 [link] [comments]
    [R] Full-batch GD generalizes better than SGD
    This paper https://arxiv.org/abs/2204.12446 shows the dependence of generalization error with respect to the optimization error. For smooth losses ( examples : log sum exp, or smoothed leaky relu activation functions), full-batch GD generalizes better than SGD (or at least compared to known generalization error bounds). Additionally, in the over-parametrized regime (exact fit) the Polyak-Lojasiewicz condition holds (https://arxiv.org/abs/2003.00307,) and T= C log(n) iterations are required to achieve generalization and excess risk of the order 1/n^2. In practice, this should be translated also as "train longer, to generalize better" and "increase the batch size to generalize better". See also the tables of the results below. https://preview.redd.it/6ska4l52mry81.jpg?width=1165&format=pjpg&auto=webp&s=222043c8152761f89c6855fd26cba0b4f88443e7 https://preview.redd.it/k038am52mry81.jpg?width=1184&format=pjpg&auto=webp&s=6155f8a5022199dec0c0c3341a962d05825cc867 submitted by /u/chaotic_shadow4444 [link] [comments]  ( 4 min )
    [P] Music mixing and AI
    I was wondering what would be the best approach to develop an AI agent that automatically mix music with sound and mix effect i.e. artificial DJ ? submitted by /u/big_dataFitness [link] [comments]  ( 1 min )
  • Open

    Animo Island, a PC game that empowers players to explore reinforcement learning as a game mechanic
    Transitional Forms presents to you, Animo Island, a rogue-lite base builder that utilizes the power of reinforcement learning to enable YOU to perform and explore complex behaviours from training cute little agents called Animo! Through the process of rewarding and training your Animo, you can teach it to perform actions on the island from which it can learn to make better decisions over time. You can customize and train your Animo to gather crystals, pick apples, plant flowers...sky’s the limit as you explore new terrain and emergent behaviours! Animo Island is teeming with potential for you to create an ever changing and autonomous paradise for your Animo’s to cultivate and grow. Through cute, loveable characters and playful environments, Animo Island invites you to create empathetic connections with creative machine intelligence in a way that has never been done before. Join our Discord community to learn more about Animo Island and get all the latest updates and information! We are currently looking for passionate folks to be play testers for our upcoming beta release, make sure to drop us a line in Discord if you’re interested! Link to Discord: https://discord.gg/rkSXhKyDVR Subreddit: r/AnimoIsland submitted by /u/AnimoIsland [link] [comments]  ( 1 min )
    Basic question about importance sampling for off-policy Montecarlo Prediction.
    submitted by /u/jamespherman [link] [comments]  ( 1 min )
    How to set up a Gym environment for a two player game like Nine Men's Morris
    I'm working on an RL agent which plays the game Nine Men's Morris against an human player (and hopefully wins most of the time). At least that is the goal. Right now I'm struggling with the game logic itself and setting up an appropriate environment with gym. Question 1: If an action closes a mill, the same player has to take another action (remove one of the opponents pieces) - how do I set this up in the step()-Function? Currently, I have two ideas: Inside the step function require another action from the agent or change the action space so that every action contains an additional parameter of where to take a piece in case of a mill. The latter seems cleaner but also increases my action space further ... Question 2: Observation is a 7 x 7 player board where a 1 marks player 1, a -1 marks player 2 and zero is an empty field. There are 600 possible actions (place piece on one of 24 legal points, take a piece from one of 24 legal points, move a piece from 24 * 24 - 24 points). Is there any way to reduce the action space? Is 600 too large? Should my agent know what actions are legal or should it learn this? Question 3: My current understanding of gym is that the entire game logic is behind the step(action) function. If the agent chooses an illegal action, the game is immediately terminated and the agent receives the lose reward. Is this correct? Question 4: My first idea at training the agent is the invert the player board after each turn - so the agent plays against itself. I've read that this is probably unstable and won't produce great results, but as a first step it would probably be enough. Another idea is to play against an older version of the agent so that it gets better over time. Do these ideas seem reasonable? Furthemore, any resource (blog article, paper, tutorial ...) on this topic (using RL on two player games) would be greatly appreciated. submitted by /u/house_92 [link] [comments]  ( 2 min )
    "Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers", Chan et al 2022
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Changing the observation space from real valued quantities to visual obs
    I would like to change the observation of the MAP environment. If you look at the code (https://github.com/openai/multiagent-particle-envs/blob/47e9ee38e605f8a563370b3c7e52a349eca3f6b1/multiagent/environment.py#L69), this is how it is initialized: self.observation_space.append(spaces.Box(low=-np.inf, high=+np.inf, shape=(obs_dim,), dtype=np.float32)) Then, if you look at one of the actual envs, for example simple spread, this (https://github.com/openai/multiagent-particle-envs/blob/47e9ee38e605f8a563370b3c7e52a349eca3f6b1/multiagent/scenarios/simple_spread.py#L100) is what at each step the env gives as an observation to the agent: return np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + comm) Is there a way I can add an image of the environment to this np array? I would like my agent to also receive a visual obs. submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    On the Verge of Solving Rocket League using Deep Reinforcement Learning and Sim-to-sim Transfer
    Paper: https://arxiv.org/abs/2205.05061 Videos: https://www.youtube.com/watch?v=8k9FNxIU0KQ Github: Coming soon Playlist: https://www.youtube.com/watch?v=WXMHJszkz6M&list=PL2KGNY2Ei3ix7Vr_vA-ZgCyVfOCfhbX0C submitted by /u/LilHairdy [link] [comments]  ( 1 min )
    Papers explaining the limitations of Q-learning and the deep Q-network
    Hi! I am currently working on my master thesis that includes custom-made environments for Q-learning and a deep Q-network from Mnih et al. (2015). One of the limitations of my custom-made environment is that the state and action space increases rapidly. For the Q-learning solution, the size of the Q-table becomes enormous with millions to billions of possible states, while the deep Q-network suffers from catastrophic forgetting. Furthermore, the size of the replay memory (1 million experiences) is not enough to efficiently deal with the rapid increase in state space. I know that "Implementing the Deep Q-Network" from Melrose Roderick, James MacGlashan, and Stefanie Tellex is a good paper explaining the limitations of DQN. My question is, does anyone know of more good papers like this? I am also interested in papers explaining the limitations of Q-learning. Thank you for any help! submitted by /u/Sondreeo [link] [comments]  ( 1 min )
  • Open

    Language Models Perform Reasoning via Chain of Thought
    Posted by Jason Wei and Denny Zhou, Research Scientists, Google Research, Brain team In recent years, scaling up the size of language models has been shown to be a reliable way to improve performance on a range of natural language processing (NLP) tasks. Today’s language models at the scale of 100B or more parameters achieve strong performance on tasks like sentiment analysis and machine translation, even with little or no training examples. Even the largest language models, however, can still struggle with certain multi-step reasoning tasks, such as math word problems and commonsense reasoning. How might we enable language models to perform such reasoning tasks? In “Chain of Thought Prompting Elicits Reasoning in Large Language Models,” we explore a prompting method for improving the re…  ( 6 min )
    Unlocking Zero-Resource Machine Translation to Support New Languages in Google Translate
    Posted by Isaac Caswell and Ankur Bapna, Research Scientists, Google Translate Machine translation (MT) technology has made significant advances in recent years, as deep learning has been integrated with natural language processing (NLP). Performance on research benchmarks like WMT have soared, and translation services have improved in quality and expanded to include new languages. Nevertheless, while existing translation services cover languages spoken by the majority of people world wide, they only include around 100 languages in total, just over 1% of those actively spoken globally. Moreover, the languages that are currently represented are overwhelmingly European, largely overlooking regions of high linguistic diversity, like Africa and the Americas. There are two key bottlenecks t…  ( 9 min )
  • Open

    Achieve in-vehicle comfort using personalized machine learning and Amazon SageMaker
    This blog post is co-written by Rudra Hota and Esaias Pech from Continental AG. Many drivers have had the experience of trying to adjust temperature settings in their vehicle while attempting to keep their eyes on the road. Whether the previous driver preferred a warmer cabin temperature, or you’re now wearing warmer clothing, or the […]  ( 8 min )
  • Open

    New Twitter account: ElementFact
    I started a new Twitter account this morning: @ElementFact. I’m thinking the account will post things like scientific facts about each element but also some history around how the element was discovered and named and other lore associated with the element. We’ll see how this goes. I’ve started many Twitter accounts over the years. Some […] New Twitter account: ElementFact first appeared on John D. Cook.  ( 1 min )
  • Open

    Transfer Learning — Part — 6.1!! Implementing Mobilenet in Keras
    In Part 6.0 of the Transfer Learning series we have discussed about Mobilenet pre-trained model in depth so in this series we will…  ( 89 min )
  • Open

    DSC Weekly Newsletter 10 May 2022: Data Meshes, Digital Twins, and Knowledge Graphs
    There’s an underlying theme to this week’s articles, which is a curious occurrence given that so much of our content is user-driven. That theme is the Value of Data. There is a tendency when looking at data in its various incarnations to view all data as somehow being valuable. Realistically, without rolling up sleeves and… Read More »DSC Weekly Newsletter 10 May 2022: Data Meshes, Digital Twins, and Knowledge Graphs The post DSC Weekly Newsletter 10 May 2022: Data Meshes, Digital Twins, and Knowledge Graphs appeared first on Data Science Central.  ( 3 min )
  • Open

    Learning Fast, Learning Slow: A General Continual Learning Method based on Complementary Learning System. (arXiv:2201.12604v2 [cs.LG] UPDATED)
    Humans excel at continually learning from an ever-changing environment whereas it remains a challenge for deep neural networks which exhibit catastrophic forgetting. The complementary learning system (CLS) theory suggests that the interplay between rapid instance-based learning and slow structured learning in the brain is crucial for accumulating and retaining knowledge. Here, we propose CLS-ER, a novel dual memory experience replay (ER) method which maintains short-term and long-term semantic memories that interact with the episodic memory. Our method employs an effective replay mechanism whereby new knowledge is acquired while aligning the decision boundaries with the semantic memories. CLS-ER does not utilize the task boundaries or make any assumption about the distribution of the data which makes it versatile and suited for "general continual learning". Our approach achieves state-of-the-art performance on standard benchmarks as well as more realistic general continual learning settings.  ( 2 min )
    Application of Transfer Learning and Ensemble Learning in Image-level Classification for Breast Histopathology. (arXiv:2204.08311v2 [cs.CV] UPDATED)
    Background: Breast cancer has the highest prevalence in women globally. The classification and diagnosis of breast cancer and its histopathological images have always been a hot spot of clinical concern. In Computer-Aided Diagnosis (CAD), traditional classification models mostly use a single network to extract features, which has significant limitations. On the other hand, many networks are trained and optimized on patient-level datasets, ignoring the application of lower-level data labels. Method: This paper proposes a deep ensemble model based on image-level labels for the binary classification of benign and malignant lesions of breast histopathological images. First, the BreaKHis dataset is randomly divided into a training, validation and test set. Then, data augmentation techniques are used to balance the number of benign and malignant samples. Thirdly, considering the performance of transfer learning and the complementarity between each network, VGG16, Xception, ResNet50, DenseNet201 are selected as the base classifiers. Result: In the ensemble network model with accuracy as the weight, the image-level binary classification achieves an accuracy of $98.90\%$. In order to verify the capabilities of our method, the latest Transformer and Multilayer Perception (MLP) models have been experimentally compared on the same dataset. Our model wins with a $5\%-20\%$ advantage, emphasizing the ensemble model's far-reaching significance in classification tasks. Conclusion: This research focuses on improving the model's classification performance with an ensemble algorithm. Transfer learning plays an essential role in small datasets, improving training speed and accuracy. Our model has outperformed many existing approaches in accuracy, providing a method for the field of auxiliary medical diagnosis.  ( 2 min )
    A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data. (arXiv:2201.12020v2 [stat.ML] UPDATED)
    This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A classical imputation method, the Expectation Maximization (EM) algorithm for Gaussian mixture models, has shown interesting properties when compared to other popular approaches such as those based on k-nearest neighbors or on multiple imputations by chained equations. However, Gaussian mixture models are known to be non-robust to heterogeneous data, which can lead to poor estimation performance when the data is contaminated by outliers or follows non-Gaussian distributions. To overcome this issue, a new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. This paper shows that this problem reduces to the estimation of a mixture of Angular Gaussian distributions under generic assumptions (i.e., each sample is drawn from a mixture of elliptical distributions, which is possibly different for one sample to another). In that case, the complete-data likelihood associated with mixtures of elliptical distributions is well adapted to the EM framework with missing data thanks to its conditional distribution, which is shown to be a multivariate $t$-distribution. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data. Furthermore, experiments conducted on real-world datasets show that this algorithm is very competitive when compared to other classical imputation methods.  ( 2 min )
    A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks. (arXiv:2205.05040v1 [cs.LG])
    In distributed training of deep neural networks or Federated Learning (FL), people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the FL setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works and design a communication-efficient gradient clipping algorithm. This algorithm can be run on multiple machines, where each machine employs a gradient clipping scheme and communicate with other machines after multiple steps of gradient-based updates. Our algorithm is proved to have $O\left(\frac{1}{N\epsilon^4}\right)$ iteration complexity for finding an $\epsilon$-stationary point, where $N$ is the number of machines. This indicates that our algorithm enjoys linear speedup. We prove this result by introducing novel analysis techniques of estimating truncated random variables, which we believe are of independent interest. Our experiments on several benchmark datasets and various scenarios demonstrate that our algorithm indeed exhibits fast convergence speed in practice and thus validates our theory.  ( 2 min )
    NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality. (arXiv:2205.04421v2 [eess.AS] UPDATED)
    Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time on this dataset.  ( 2 min )
    Engineering flexible machine learning systems by traversing functionally invariant paths in weight space. (arXiv:2205.00334v2 [cs.LG] UPDATED)
    Deep neural networks achieve human-like performance on a variety of perceptual and decision making tasks. However, deep networks perform poorly when confronted with changing tasks or goals, and broadly fail to match the flexibility and robustness of human intelligence. Here, we develop a mathematical and algorithmic framework that enables continual training of deep neural networks on a broad range of objectives by defining path connected sets of neural networks that achieve equivalent functional performance on a given machine learning task while modulating network weights to achieve high-performance on a secondary objective. We view the weight space of a neural network as a curved Riemannian manifold and move a neural network along a functionally invariant path in weight space while searching for networks that satisfy a secondary objective. We introduce a path-sampling algorithm that trains networks with millions of weight parameters to learn a series of image classification tasks without performance loss. The algorithm generalizes to accommodate a range of secondary objectives including weight-pruning and weight diversification and exhibits state of the art performance on network compression and adversarial robustness benchmarks. Broadly, we demonstrate how the intrinsic geometry of machine learning problems can be harnessed to construct flexible and robust neural networks.  ( 2 min )
    Deep Learning Based Cloud Cover Parameterization for ICON. (arXiv:2112.11317v2 [physics.ao-ph] UPDATED)
    A promising approach to improve cloud parameterizations within climate models and thus climate projections is to use deep learning in combination with training data from storm-resolving model (SRM) simulations. The ICOsahedral Non-hydrostatic (ICON) modeling framework permits simulations ranging from numerical weather prediction to climate projections, making it an ideal target to develop neural network (NN) based parameterizations for sub-grid scale processes. Within the ICON framework, we train NN based cloud cover parameterizations with coarse-grained data based on realistic regional and global ICON SRM simulations. We set up three different types of NNs that differ in the degree of vertical locality they assume for diagnosing cloud cover from coarse-grained atmospheric state variables. The NNs accurately estimate sub-grid scale cloud cover from coarse-grained data that has similar geographical characteristics as their training data. Additionally, globally trained NNs can reproduce sub-grid scale cloud cover of the regional SRM simulation. Using the game-theory based interpretability library SHapley Additive exPlanations, we identify an overemphasis on specific humidity and cloud ice as the reason why our column-based NN cannot perfectly generalize from the global to the regional coarse-grained SRM data. The interpretability tool also helps visualize similarities and differences in feature importance between regionally and globally trained column-based NNs, and reveals a local relationship between their cloud cover predictions and the thermodynamic environment. Our results show the potential of deep learning to derive accurate yet interpretable cloud cover parameterizations from global SRMs, and suggest that neighborhood-based models may be a good compromise between accuracy and generalizability.  ( 2 min )
    SmartSAGE: Training Large-scale Graph Neural Networks using In-Storage Processing Architectures. (arXiv:2205.04711v1 [cs.AR])
    Graph neural networks (GNNs) can extract features by learning both the representation of each objects (i.e., graph nodes) and the relationship across different objects (i.e., the edges that connect nodes), achieving state-of-the-art performance in various graph-based tasks. Despite its strengths, utilizing these algorithms in a production environment faces several challenges as the number of graph nodes and edges amount to several billions to hundreds of billions scale, requiring substantial storage space for training. Unfortunately, state-of-the-art ML frameworks employ an in-memory processing model which significantly hampers the productivity of ML practitioners as it mandates the overall working set to fit within DRAM capacity. In this work, we first conduct a detailed characterization on a state-of-the-art, large-scale GNN training algorithm, GraphSAGE. Based on the characterization, we then explore the feasibility of utilizing capacity-optimized NVM SSDs for storing memory-hungry GNN data, which enables large-scale GNN training beyond the limits of main memory size. Given the large performance gap between DRAM and SSD, however, blindly utilizing SSDs as a direct substitute for DRAM leads to significant performance loss. We therefore develop SmartSAGE, our software/hardware co-design based on an in-storage processing (ISP) architecture. Our work demonstrates that an ISP based large-scale GNN training system can achieve both high capacity storage and high performance, opening up opportunities for ML practitioners to train large GNN datasets without being hampered by the physical limitations of main memory size.  ( 2 min )
    An Algorithmic Framework for Bias Bounties. (arXiv:2201.10408v4 [cs.LG] UPDATED)
    We propose and analyze an algorithmic framework for "bias bounties": events in which external participants are invited to propose improvements to a trained model, akin to bug bounty events in software and security. Our framework allows participants to submit arbitrary subgroup improvements, which are then algorithmically incorporated into an updated model. Our algorithm has the property that there is no tension between overall and subgroup accuracies, nor between different subgroup accuracies, and it enjoys provable convergence to either the Bayes optimal model or a state in which no further improvements can be found by the participants. We provide formal analyses of our framework, experimental evaluation, and findings from a preliminary bias bounty event.  ( 2 min )
    RLFlow: Optimising Neural Network Subgraph Transformation with World Models. (arXiv:2205.01435v2 [cs.LG] UPDATED)
    Training deep learning models takes an extremely long execution time and consumes large amounts of computing resources. At the same time, recent research proposed systems and compilers that are expected to decrease deep learning models runtime. An effective optimisation methodology in data processing is desirable, and the reduction of compute requirements of deep learning models is the focus of extensive research. In this paper, we address the neural network sub-graph transformation by exploring reinforcement learning (RL) agents to achieve performance improvement. Our proposed approach RLFlow can learn to perform neural network subgraph transformations, without the need for expertly designed heuristics to achieve a high level of performance. Recent work has aimed at applying RL to computer systems with some success, especially using model-free RL techniques. Model-based reinforcement learning methods have seen an increased focus in research as they can be used to learn the transition dynamics of the environment; this can be leveraged to train an agent using a hallucinogenic environment such as World Model (WM), thereby increasing sample efficiency compared to model-free approaches. WM uses variational auto-encoders and it builds a model of the system and allows exploring the model in an inexpensive way. In RLFlow, we propose a design for a model-based agent with WM which learns to optimise the architecture of neural networks by performing a sequence of sub-graph transformations to reduce model runtime. We show that our approach can match the state-of-the-art performance on common convolutional networks and outperforms by up to 5% those based on transformer-style architectures  ( 2 min )
    Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards. (arXiv:2205.04702v1 [cs.AR])
    Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the "future" cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can "always" be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed.  ( 2 min )
    Cognitive Visual-learning Environment for PostgreSQL. (arXiv:2205.04834v1 [cs.LG])
    PostgreSQL is an object-relational database (ORDBMS) that was introduced into the database community and has been avidly used for a variety of information extraction use cases. It is also known to be an advanced SQL-compliant open source Object RDBMS. However, users have not yet resolved to PostgreSQL due to the fact that it is still under the layers and the complexity of its persistent textual environment for an amateur user. Hence, there is a dire need to provide an easy environment for users to comprehend the procedure and standards with which databases are created, tables and the relationships among them, manipulating queries and their flow based on conditions in PostgreSQL. As such, this project identifies the dominant features offered by Postgresql, analyzes the constraints that exist in the database user community in migrating to PostgreSQL and based on the scope and constraints identified, develop a system that will serve as a query generation platform as well as a learning tool that will provide an interactive environment to cognitively learn PostgreSQL query building. This is achieved using a visual editor incorporating a textual editor for a well-versed user. By providing visually-draggable query components to work with, this research aims to offer a cognitive, visual and tactile environment where users can interactively learn PostgreSQL query generation.  ( 2 min )
    Learning Relative Return Policies With Upside-Down Reinforcement Learning. (arXiv:2202.12742v2 [cs.LG] UPDATED)
    Lately, there has been a resurgence of interest in using supervised learning to solve reinforcement learning problems. Recent work in this area has largely focused on learning command-conditioned policies. We investigate the potential of one such method -- upside-down reinforcement learning -- to work with commands that specify a desired relationship between some scalar value and the observed return. We show that upside-down reinforcement learning can learn to carry out such commands online in a tabular bandit setting and in CartPole with non-linear function approximation. By doing so, we demonstrate the power of this family of methods and open the way for their practical use under more complicated command structures.  ( 2 min )
    Federated Random Reshuffling with Compression and Variance Reduction. (arXiv:2205.03914v2 [cs.LG] UPDATED)
    Random Reshuffling (RR), which is a variant of Stochastic Gradient Descent (SGD) employing sampling without replacement, is an immensely popular method for training supervised machine learning models via empirical risk minimization. Due to its superior practical performance, it is embedded and often set as default in standard machine learning software. Under the name FedRR, this method was recently shown to be applicable to federated learning (Mishchenko et al.,2021), with superior performance when compared to common baselines such as Local SGD. Inspired by this development, we design three new algorithms to improve FedRR further: compressed FedRR and two variance reduced extensions: one for taming the variance coming from shuffling and the other for taming the variance due to compression. The variance reduction mechanism for compression allows us to eliminate dependence on the compression parameter, and applying additional controlled linear perturbations for Random Reshuffling, introduced by Malinovsky et al.(2021) helps to eliminate variance at the optimum. We provide the first analysis of compressed local methods under standard assumptions without bounded gradient assumptions and for heterogeneous data, overcoming the limitations of the compression operator. We corroborate our theoretical results with experiments on synthetic and real data sets.
    Automatic Sleep Staging of EEG Signals: Recent Development, Challenges, and Future Directions. (arXiv:2111.08446v3 [eess.SP] UPDATED)
    Modern deep learning holds a great potential to transform clinical practice on human sleep. Teaching a machine to carry out routine tasks would be a tremendous reduction in workload for clinicians. Sleep staging, a fundamental step in sleep practice, is a suitable task for this and will be the focus in this article. Recently, automatic sleep staging systems have been trained to mimic manual scoring, leading to similar performance to human sleep experts, at least on scoring of healthy subjects. Despite tremendous progress, we have not seen automatic sleep scoring adopted widely in clinical environments. This review aims to give a shared view of the authors on the most recent state-of-the-art development in automatic sleep staging, the challenges that still need to be addressed, and the future directions for automatic sleep scoring to achieve clinical value.
    Semantic features of object concepts generated with GPT-3. (arXiv:2202.03753v2 [cs.CL] UPDATED)
    Semantic features have been playing a central role in investigating the nature of our conceptual representations. Yet the enormous time and effort required to empirically sample and norm features from human raters has restricted their use to a limited set of manually curated concepts. Given recent promising developments with transformer-based language models, here we asked whether it was possible to use such models to automatically generate meaningful lists of properties for arbitrary object concepts and whether these models would produce features similar to those found in humans. To this end, we probed a GPT-3 model to generate semantic features for 1,854 objects and compared automatically-generated features to existing human feature norms. GPT-3 generated many more features than humans, yet showed a similar distribution in the types of generated features. Generated feature norms rivaled human norms in predicting similarity, relatedness, and category membership, while variance partitioning demonstrated that these predictions were driven by similar variance in humans and GPT-3. Together, these results highlight the potential of large language models to capture important facets of human knowledge and yield a new approach for automatically generating interpretable feature sets, thus drastically expanding the potential use of semantic features in psychological and linguistic studies.
    Category-orthogonal object features guide information processing in recurrent neural networks trained for object categorization. (arXiv:2111.07898v2 [cs.CV] UPDATED)
    Recurrent neural networks (RNNs) have been shown to perform better than feedforward architectures in visual object categorization tasks, especially in challenging conditions such as cluttered images. However, little is known about the exact computational role of recurrent information flow in these conditions. Here we test RNNs trained for object categorization on the hypothesis that recurrence iteratively aids object categorization via the communication of category-orthogonal auxiliary variables (the location, orientation, and scale of the object). Using diagnostic linear readouts, we find that: (a) information about auxiliary variables increases across time in all network layers, (b) this information is indeed present in the recurrent information flow, and (c) its manipulation significantly affects task performance. These observations confirm the hypothesis that category-orthogonal auxiliary variable information is conveyed through recurrent connectivity and is used to optimize category inference in cluttered environments.
    Tight Last-Iterate Convergence of the Extragradient and the Optimistic Gradient Descent-Ascent Algorithm for Constrained Monotone Variational Inequalities. (arXiv:2204.09228v2 [math.OC] UPDATED)
    The monotone variational inequality is a central problem in mathematical programming that unifies and generalizes many important settings such as smooth convex optimization, two-player zero-sum games, convex-concave saddle point problems, etc. The extragradient algorithm by Korpelevich [1976] and the optimistic gradient descent-ascent algorithm by Popov [1980] are arguably the two most classical and popular methods for solving monotone variational inequalities. Despite its long history, the following major problem remains open. What is the last-iterate convergence rate of the extragradient algorithm or the optimistic gradient descent-ascent algorithm for monotone and Lipschitz variational inequalities with constraints? We resolve this open problem by showing that both the extragradient algorithm and the optimistic gradient descent-ascent algorithm have a tight $O\left(\frac{1}{\sqrt{T}}\right)$ last-iterate convergence rate for arbitrary convex feasible sets, which matches the lower bound by Golowich et al. [2020a, b]. Our rate is measured in terms of the standard gap function. At the core of our results lies a new performance measure -- the tangent residual, which can be viewed as an adaptation of the norm of the operator that takes the local constraints into account. We use the tangent residual (or a slight variation of the tangent residual) as the performance measure in our analysis of the extragradient algorithm (or the optimistic gradient descent-ascent algorithm). To establish the monotonicity of these performance measures, we develop a new approach that combines the power of the sum-of-squares programming with the low dimensionality of the update rule of the extragradient or the optimistic gradient descent-ascent algorithm. We believe our approach has many additional applications in the analysis of iterative methods.
    Labeled sample compression schemes for complexes of oriented matroids. (arXiv:2110.15168v2 [math.CO] UPDATED)
    We show that the topes of a complex of oriented matroids (abbreviated COM) of VC-dimension $d$ admit a proper labeled sample compression scheme of size $d$. This considerably extends results of Moran and Warmuth on ample classes, of Ben-David and Litman on affine arrangements of hyperplanes, and of the authors on complexes of uniform oriented matroids, and is a step towards the sample compression conjecture -- one of the oldest open problems in computational learning theory. On the one hand, our approach exploits the rich combinatorial cell structure of COMs via oriented matroid theory. On the other hand, viewing tope graphs of COMs as partial cubes creates a fruitful link to metric graph theory.
    Memory-Efficient Convex Optimization for Self-Dictionary Separable Nonnegative Matrix Factorization: A Frank-Wolfe Approach. (arXiv:2109.11135v2 [eess.SP] UPDATED)
    Nonnegative matrix factorization (NMF) often relies on the separability condition for tractable algorithm design. Separability-based NMF is mainly handled by two types of approaches, namely, greedy pursuit and convex programming. A notable convex NMF formulation is the so-called self-dictionary multiple measurement vectors (SD-MMV), which can work without knowing the matrix rank a priori, and is arguably more resilient to error propagation relative to greedy pursuit. However, convex SD-MMV renders a large memory cost that scales quadratically with the problem size. This memory challenge has been around for a decade, and a major obstacle for applying convex SD-MMV to big data analytics. This work proposes a memory-efficient algorithm for convex SD-MMV. Our algorithm capitalizes on the special update rules of a classic algorithm from the 1950s, namely, the Frank-Wolfe (FW) algorithm. It is shown that, under reasonable conditions, the FW algorithm solves the noisy SD-MMV problem with a memory cost that grows linearly with the amount of data. To handle noisier scenarios, a smoothed group sparsity regularizer is proposed to improve robustness while maintaining the low memory footprint with guarantees. The proposed approach presents the first linear memory complexity algorithmic framework for convex SD-MMV based NMF. The method is tested over a couple of unsupervised learning tasks, i.e., text mining and community detection, to showcase its effectiveness and memory efficiency.
    Text Simplification by Tagging. (arXiv:2103.05070v1 [cs.CL] CROSS LISTED)
    Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are trained on parallel corpora, these methods have proven to be much more effective since they are able to learn to make fast and accurate transformations while leveraging powerful pre-trained language models. Inspired by these ideas, we present TST, a simple and efficient Text Simplification system based on sequence Tagging, leveraging pre-trained Transformer-based encoders. Our system makes simplistic data augmentations and tweaks in training and inference on a pre-existing system, which makes it less reliant on large amounts of parallel training data, provides more control over the outputs and enables faster inference speeds. Our best model achieves near state-of-the-art performance on benchmark test datasets for the task. Since it is fully non-autoregressive, it achieves faster inference speeds by over 11 times than the current state-of-the-art text simplification system.
    VOS: Learning What You Don't Know by Virtual Outlier Synthesis. (arXiv:2202.01197v4 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection has received much attention lately due to its importance in the safe deployment of neural networks. One of the key challenges is that models lack supervision signals from unknown data, and as a result, can produce overconfident predictions on OOD data. Previous approaches rely on real outlier datasets for model regularization, which can be costly and sometimes infeasible to obtain in practice. In this paper, we present VOS, a novel framework for OOD detection by adaptively synthesizing virtual outliers that can meaningfully regularize the model's decision boundary during training. Specifically, VOS samples virtual outliers from the low-likelihood region of the class-conditional distribution estimated in the feature space. Alongside, we introduce a novel unknown-aware training objective, which contrastively shapes the uncertainty space between the ID data and synthesized outlier data. VOS achieves competitive performance on both object detection and image classification models, reducing the FPR95 by up to 9.36% compared to the previous best method on object detectors. Code is available at https://github.com/deeplearning-wisc/vos.
    Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting. (arXiv:2010.04456v6 [stat.ML] UPDATED)
    Forecasting complex dynamical phenomena in settings where only partial knowledge of their dynamics is available is a prevalent problem across various scientific fields. While purely data-driven approaches are arguably insufficient in this context, standard physical modeling based approaches tend to be over-simplistic, inducing non-negligible errors. In this work, we introduce the APHYNITY framework, a principled approach for augmenting incomplete physical dynamics described by differential equations with deep data-driven models. It consists in decomposing the dynamics into two components: a physical component accounting for the dynamics for which we have some prior knowledge, and a data-driven component accounting for errors of the physical model. The learning problem is carefully formulated such that the physical model explains as much of the data as possible, while the data-driven component only describes information that cannot be captured by the physical model, no more, no less. This not only provides the existence and uniqueness for this decomposition, but also ensures interpretability and benefits generalization. Experiments made on three important use cases, each representative of a different family of phenomena, i.e. reaction-diffusion equations, wave equations and the non-linear damped pendulum, show that APHYNITY can efficiently leverage approximate physical models to accurately forecast the evolution of the system and correctly identify relevant physical parameters. Code is available at https://github.com/yuan-yin/APHYNITY .
    Human-robot collaboration and machine learning: a systematic review of recent research. (arXiv:2110.07448v3 [cs.RO] UPDATED)
    Technological progress increasingly envisions the use of robots interacting with people in everyday life. Human-robot collaboration (HRC) is the approach that explores the interaction between a human and a robot, during the completion of a common objective, at the cognitive and physical level. In HRC works, a cognitive model is typically built, which collects inputs from the environment and from the user, elaborates and translates these into information that can be used by the robot itself. Machine learning is a recent approach to build the cognitive model and behavioural block, with high potential in HRC. Consequently, this paper proposes a thorough literature review of the use of machine learning techniques in the context of human-robot collaboration. 45 key papers were selected and analysed, and a clustering of works based on the type of collaborative tasks, evaluation metrics and cognitive variables modelled is proposed. Then, a deep analysis on different families of machine learning algorithms and their properties, along with the sensing modalities used, is carried out. Among the observations, it is outlined the importance of the machine learning algorithms to incorporate time dependencies. The salient features of these works are then cross-analysed to show trends in HRC and give guidelines for future works, comparing them with other aspects of HRC not appeared in the review.
    Semi-Targeted Model Poisoning Attack on Federated Learning via Backward Error Analysis. (arXiv:2203.11633v2 [cs.LG] UPDATED)
    Model poisoning attacks on federated learning (FL) intrude in the entire system via compromising an edge model, resulting in malfunctioning of machine learning models. Such compromised models are tampered with to perform adversary-desired behaviors. In particular, we considered a semi-targeted situation where the source class is predetermined however the target class is not. The goal is to cause the global classifier to misclassify data of the source class. Though approaches such as label flipping have been adopted to inject poisoned parameters into FL, it has been shown that their performances are usually class-sensitive varying with different target classes applied. Typically, an attack can become less effective when shifting to a different target class. To overcome this challenge, we propose the Attacking Distance-aware Attack (ADA) to enhance a poisoning attack by finding the optimized target class in the feature space. Moreover, we studied a more challenging situation where an adversary had limited prior knowledge about a client's data. To tackle this problem, ADA deduces pair-wise distances between different classes in the latent feature space from shared model parameters based on the backward error analysis. We performed extensive empirical evaluations on ADA by varying the factor of attacking frequency in three different image classification tasks. As a result, ADA succeeded in increasing the attack performance by 1.8 times in the most challenging case with an attacking frequency of 0.01.
    Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation. (arXiv:2204.01171v2 [cs.CL] UPDATED)
    Current language generation models suffer from issues such as repetition, incoherence, and hallucinations. An often-repeated hypothesis is that this brittleness of generation models is caused by the training and the generation procedure mismatch, also referred to as exposure bias. In this paper, we verify this hypothesis by analyzing exposure bias from an imitation learning perspective. We show that exposure bias leads to an accumulation of errors, analyze why perplexity fails to capture this accumulation, and empirically show that this accumulation results in poor generation quality. Source code to reproduce these experiments is available at https://github.com/kushalarora/quantifying_exposure_bias
    Verifying Integrity of Deep Ensemble Models by Lossless Black-box Watermarking with Sensitive Samples. (arXiv:2205.04145v2 [cs.CR] UPDATED)
    With the widespread use of deep neural networks (DNNs) in many areas, more and more studies focus on protecting DNN models from intellectual property (IP) infringement. Many existing methods apply digital watermarking to protect the DNN models. The majority of them either embed a watermark directly into the internal network structure/parameters or insert a zero-bit watermark by fine-tuning a model to be protected with a set of so-called trigger samples. Though these methods work very well, they were designed for individual DNN models, which cannot be directly applied to deep ensemble models (DEMs) that combine multiple DNN models to make the final decision. It motivates us to propose a novel black-box watermarking method in this paper for DEMs, which can be used for verifying the integrity of DEMs. In the proposed method, a certain number of sensitive samples are carefully selected through mimicking real-world DEM attacks and analyzing the prediction results of the sub-models of the non-attacked DEM and the attacked DEM on the carefully crafted dataset. By analyzing the prediction results of the target DEM on these carefully crafted sensitive samples, we are able to verify the integrity of the target DEM. Different from many previous methods, the proposed method does not modify the original DEM to be protected, which indicates that the proposed method is lossless. Experimental results have shown that the DEM integrity can be reliably verified even if only one sub-model was attacked, which has good potential in practice.
    Exploring Viable Algorithmic Options for Learning from Demonstration (LfD): A Parameterized Complexity Approach. (arXiv:2205.04989v1 [cs.LG])
    The key to reconciling the polynomial-time intractability of many machine learning tasks in the worst case with the surprising solvability of these tasks by heuristic algorithms in practice seems to be exploiting restrictions on real-world data sets. One approach to investigating such restrictions is to analyze why heuristics perform well under restrictions. A complementary approach would be to systematically determine under which sets of restrictions efficient and reliable machine learning algorithms do and do not exist. In this paper, we show how such a systematic exploration of algorithmic options can be done using parameterized complexity analysis, As an illustrative example, we give the first parameterized complexity analysis of batch and incremental policy inference under Learning from Demonstration (LfD). Relative to a basic model of LfD, we show that none of our problems can be solved efficiently either in general or relative to a number of (often simultaneous) restrictions on environments, demonstrations, and policies. We also give the first known restrictions under which efficient solvability is possible and discuss the implications of our solvability and unsolvability results for both our basic model of LfD and more complex models of LfD used in practice.
    Representing Hierarchical Structure by Using Cone Embedding. (arXiv:2102.08014v2 [cs.AI] UPDATED)
    Graph embedding is becoming an important method with applications in various areas, including social networks and knowledge graph completion. In particular, Poincar\'e embedding has been proposed to capture the hierarchical structure of graphs, and its effectiveness has been reported. However, most of the existing methods have isometric mappings in the embedding space, and the choice of the origin point can be arbitrary. This fact is not desirable when the distance from the origin is used as an indicator of hierarchy, as in the case of Poincar\'e embedding. In this paper, we propose cone embedding, embedding method in a metric cone, which solve these problems, and we gain further benefits: 1) we provide an indicator of hierarchical information that is both geometrically and intuitively natural to interpret, and 2) we can extract the hierarchical structure from a graph embedding output of other methods by learning additional one-dimensional parameters.
    Entity Linking and Discovery via Arborescence-based Supervised Clustering. (arXiv:2109.01242v2 [cs.CL] UPDATED)
    Previous work has shown promising results in performing entity linking by measuring not only the affinities between mentions and entities but also those amongst mentions. In this paper, we present novel training and inference procedures that fully utilize mention-to-mention affinities by building minimum arborescences (i.e., directed spanning trees) over mentions and entities across documents in order to make linking decisions. We also show that this method gracefully extends to entity discovery, enabling the clustering of mentions that do not have an associated entity in the knowledge base. We evaluate our approach on the Zero-Shot Entity Linking dataset and MedMentions, the largest publicly available biomedical dataset, and show significant improvements in performance for both entity linking and discovery compared to identically parameterized models. We further show significant efficiency improvements with only a small loss in accuracy over previous work, which use more computationally expensive models.
    Learning to Answer Visual Questions from Web Videos. (arXiv:2205.05019v1 [cs.CV])
    Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA feature probe evaluation setting and show excellent results, in particular for rare answers. Furthermore, our method achieves competitive results on MSRVTT-QA, ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our VideoQA dataset generation approach generalizes to another source of web video and text data. We use our method to generate the \webdataname{} dataset from the WebVid dataset, i.e., videos with alt-text annotations, and show its benefits for training VideoQA models. Finally, for a detailed evaluation we introduce \smalldatasetname{}, a new VideoQA dataset with reduced language bias and high-quality manual annotations. Code, datasets and trained models are available at https://antoyang.github.io/just-ask.html
    Fundamental limitations on optimization in variational quantum algorithms. (arXiv:2205.05056v1 [quant-ph])
    Exploring quantum applications of near-term quantum devices is a rapidly growing field of quantum information science with both theoretical and practical interests. A leading paradigm to establish such near-term quantum applications is variational quantum algorithms (VQAs). These algorithms use a classical optimizer to train a parameterized quantum circuit to accomplish certain tasks, where the circuits are usually randomly initialized. In this work, we prove that for a broad class of such random circuits, the variation range of the cost function via adjusting any local quantum gate within the circuit vanishes exponentially in the number of qubits with a high probability. This result can unify the restrictions on gradient-based and gradient-free optimizations in a natural manner and reveal extra harsh constraints on the training landscapes of VQAs. Hence a fundamental limitation on the trainability of VQAs is unraveled, indicating the essence of the optimization hardness in the Hilbert space with exponential dimension. We further showcase the validity of our results with numerical simulations of representative VQAs. We believe that these results would deepen our understanding of the scalability of VQAs and shed light on the search for near-term quantum applications with advantages.
    Adaptive Ranking Based Constraint Handling for Explicitly Constrained Black-Box Optimization. (arXiv:1811.00764v3 [cs.NE] UPDATED)
    We propose a novel constraint-handling technique for the covariance matrix adaptation evolution strategy (CMA-ES). The proposed technique is aimed at solving explicitly constrained black-box continuous optimization problems, in which the explicit constraint is a constraint whereby the computational time for the constraint violation and its (numerical) gradient are negligible compared to that for the objective function. This method is designed to realize two invariance properties: invariance to the affine transformation of the search space, and invariance to the increasing transformation of the objective and constraint functions. The CMA-ES is designed to possess these properties for handling difficulties that appear in black-box optimization problems, such as non-separability, ill-conditioning, ruggedness, and the different orders of magnitude in the objective. The proposed constraint-handling technique (CHT), known as ARCH, modifies the underlying CMA-ES only in terms of the ranking of the candidate solutions. It employs a repair operator and an adaptive ranking aggregation strategy to compute the ranking. We developed test problems to evaluate the effects of the invariance properties, and performed experiments to empirically verify the invariance of the algorithm. We compared the proposed method with other CHTs on the CEC 2006 constrained optimization benchmark suite to demonstrate its efficacy. Empirical studies reveal that ARCH is able to exploit the explicitness of the constraint functions effectively, sometimes even more efficiently than an existing box-constraint handling technique on box-constrained problems, while exhibiting the invariance properties. Moreover, ARCH overwhelmingly outperforms CHTs by not exploiting the explicit constraints in terms of the number of objective function calls.
    Hybrid Far- and Near-Field Channel Estimation for THz Ultra-Massive MIMO via Fixed Point Networks. (arXiv:2205.04944v1 [eess.SP])
    Terahertz ultra-massive multiple-input multiple-output (THz UM-MIMO) is envisioned as one of the key enablers of 6G wireless systems. Due to the joint effect of its large array aperture and small wavelength, the near-field region of THz UM-MIMO systems is greatly enlarged. The high-dimensional channel of such systems thus consists of a stochastic mixture of far and near fields, which renders channel estimation extremely challenging. Previous works based on uni-field assumptions cannot capture the hybrid far- and near-field features, and will suffer significant performance loss. This motivates us to consider hybrid-field channel estimation. We draw inspirations from fixed point theory to develop an efficient deep learning based channel estimator with adaptive complexity and linear convergence guarantee. Built upon classic orthogonal approximate message passing, we transform each iteration into a contractive mapping, comprising a closed-form linear estimator and a neural network based non-linear estimator. A major algorithmic innovation involves applying fixed point iteration to compute the channel estimate while modeling neural networks with arbitrary depth and adapting to the hybrid-field channel conditions. Simulation results will verify our theoretical analysis and show significant performance gains over state-of-the-art approaches in the estimation accuracy and convergence rate.
    Impact of L1 Batch Normalization on Analog Noise Resistant Property of Deep Learning Models. (arXiv:2205.04886v1 [cs.LG])
    Analog hardware has become a popular choice for machine learning on resource-constrained devices recently due to its fast execution and energy efficiency. However, the inherent presence of noise in analog hardware and the negative impact of the noise on deployed deep neural network (DNN) models limit their usage. The degradation in performance due to the noise calls for the novel design of DNN models that have excellent noiseresistant property, leveraging the properties of the fundamental building block of DNN models. In this work, the use of L1 or TopK BatchNorm type, a fundamental DNN model building block, in designing DNN models with excellent noise-resistant property is proposed. Specifically, a systematic study has been carried out by training DNN models with L1/TopK BatchNorm type, and the performance is compared with DNN models with L2 BatchNorm types. The resulting model noise-resistant property is tested by injecting additive noise to the model weights and evaluating the new model inference accuracy due to the noise. The results show that L1 and TopK BatchNorm type has excellent noise-resistant property, and there is no sacrifice in performance due to the change in the BatchNorm type from L2 to L1/TopK BatchNorm type.
    On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning. (arXiv:2105.01648v4 [cs.LG] UPDATED)
    The lottery ticket hypothesis questions the role of overparameterization in supervised deep learning. But how is the performance of winning lottery tickets affected by the distributional shift inherent to reinforcement learning problems? In this work, we address this question by comparing sparse agents who have to address the non-stationarity of the exploration-exploitation problem with supervised agents trained to imitate an expert. We show that feed-forward networks trained with behavioural cloning compared to reinforcement learning can be pruned to higher levels of sparsity without performance degradation. This suggests that in order to solve the RL-specific distributional shift agents require more degrees of freedom. Using a set of carefully designed baseline conditions, we find that the majority of the lottery ticket effect in both learning paradigms can be attributed to the identified mask rather than the weight initialization. The input layer mask selectively prunes entire input dimensions that turn out to be irrelevant for the task at hand. At a moderate level of sparsity the mask identified by iterative magnitude pruning yields minimal task-relevant representations, i.e., an interpretable inductive bias. Finally, we propose a simple initialization rescaling which promotes the robust identification of sparse task representations in low-dimensional control tasks.
    Text-Free Prosody-Aware Generative Spoken Language Modeling. (arXiv:2109.03264v2 [cs.CL] UPDATED)
    Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) \cite{Lakhotia2021} is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. We devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt. Audio samples can be found at https://speechbot.github.io/pgslm. Codes and models are available at https://github.com/pytorch/fairseq/tree/main/examples/textless_nlp/pgslm.
    Adaptive Graph Convolutional Network Framework for Multidimensional Time Series Prediction. (arXiv:2205.04885v1 [cs.LG])
    In the real world, long sequence time-series forecasting (LSTF) is needed in many cases, such as power consumption prediction and air quality prediction.Multi-dimensional long time series model has more strict requirements on the model, which not only needs to effectively capture the accurate long-term dependence between input and output, but also needs to capture the relationship between data of different dimensions.Recent research shows that the Informer model based on Transformer has achieved excellent performance in long time series prediction.However, this model still has some deficiencies in multidimensional prediction,it cannot capture the relationship between different dimensions well. We improved Informer to address its shortcomings in multidimensional forecasting. First,we introduce an adaptive graph neural network to capture hidden dimension dependencies in mostly time series prediction. Secondly,we integrate adaptive graph convolutional networks into various spatio-temporal series prediction models to solve the defect that they cannot capture the relationship between different dimensions. Thirdly,After experimental testing with multiple data sets, the accuracy of our framework improved by about 10\% after being introduced into the model.
    An overview of artificial intelligence techniques for diagnosis of Schizophrenia based on magnetic resonance imaging modalities: Methods, challenges, and future works. (arXiv:2103.03081v3 [cs.LG] UPDATED)
    Schizophrenia (SZ) is a mental disorder that typically emerges in late adolescence or early adulthood. It reduces the life expectancy of patients by 15 years. Abnormal behavior, perception of emotions, social relationships, and reality perception are among its most significant symptoms. Past studies have revealed that SZ affects the temporal and anterior lobes of hippocampus regions of the brain. Also, increased volume of cerebrospinal fluid (CSF) and decreased volume of white and gray matter can be observed due to this disease. Magnetic resonance imaging (MRI) is the popular neuroimaging technique used to explore structural/functional brain abnormalities in SZ disorder, owing to its high spatial resolution. Various artificial intelligence (AI) techniques have been employed with advanced image/signal processing methods to accurately diagnose SZ. This paper presents a comprehensive overview of studies conducted on the automated diagnosis of SZ using MRI modalities. First, an AI-based computer aided-diagnosis system (CADS) for SZ diagnosis and its relevant sections are presented. Then, this section introduces the most important conventional machine learning (ML) and deep learning (DL) techniques in the diagnosis of diagnosing SZ. A comprehensive comparison is also made between ML and DL studies in the discussion section. In the following, the most important challenges in diagnosing SZ are addressed. Future works in diagnosing SZ using AI techniques and MRI modalities are recommended in another section. Results, conclusion, and research findings are also presented at the end.
    On learning agent-based models from data. (arXiv:2205.05052v1 [physics.soc-ph])
    Agent-Based Models (ABMs) are used in several fields to study the evolution of complex systems from micro-level assumptions. However, ABMs typically can not estimate agent-specific (or "micro") variables: this is a major limitation which prevents ABMs from harnessing micro-level data availability and which greatly limits their predictive power. In this paper, we propose a protocol to learn the latent micro-variables of an ABM from data. The first step of our protocol is to reduce an ABM to a probabilistic model, characterized by a computationally tractable likelihood. This reduction follows two general design principles: balance of stochasticity and data availability, and replacement of unobservable discrete choices with differentiable approximations. Then, our protocol proceeds by maximizing the likelihood of the latent variables via a gradient-based expectation maximization algorithm. We demonstrate our protocol by applying it to an ABM of the housing market, in which agents with different incomes bid higher prices to live in high-income neighborhoods. We demonstrate that the obtained model allows accurate estimates of the latent variables, while preserving the general behavior of the ABM. We also show that our estimates can be used for out-of-sample forecasting. Our protocol can be seen as an alternative to black-box data assimilation methods, that forces the modeler to lay bare the assumptions of the model, to think about the inferential process, and to spot potential identification problems.
    Secure Distributed/Federated Learning: Prediction-Privacy Trade-Off for Multi-Agent System. (arXiv:2205.04855v1 [cs.MA])
    Decentralized learning is an efficient emerging paradigm for boosting the computing capability of multiple bounded computing agents. In the big data era, performing inference within the distributed and federated learning (DL and FL) frameworks, the central server needs to process a large amount of data while relying on various agents to perform multiple distributed training tasks. Considering the decentralized computing topology, privacy has become a first-class concern. Moreover, assuming limited information processing capability for the agents calls for a sophisticated \textit{privacy-preserving decentralization} that ensures efficient computation. Towards this end, we study the \textit{privacy-aware server to multi-agent assignment} problem subject to information processing constraints associated with each agent, while maintaining the privacy and assuring learning informative messages received by agents about a global terminal through the distributed private federated learning (DPFL) approach. To find a decentralized scheme for a two-agent system, we formulate an optimization problem that balances privacy and accuracy, taking into account the quality of compression constraints associated with each agent. We propose an iterative converging algorithm by alternating over self-consistent equations. We also numerically evaluate the proposed solution to show the privacy-prediction trade-off and demonstrate the efficacy of the novel approach in ensuring privacy in DL and FL.
    On the Verge of Solving Rocket League using Deep Reinforcement Learning and Sim-to-sim Transfer. (arXiv:2205.05061v1 [cs.LG])
    Autonomously trained agents that are supposed to play video games reasonably well rely either on fast simulation speeds or heavy parallelization across thousands of machines running concurrently. This work explores a third way that is established in robotics, namely sim-to-real transfer, or if the game is considered a simulation itself, sim-to-sim transfer. In the case of Rocket League, we demonstrate that single behaviors of goalies and strikers can be successfully learned using Deep Reinforcement Learning in the simulation environment and transferred back to the original game. Although the implemented training simulation is to some extent inaccurate, the goalkeeping agent saves nearly 100% of its faced shots once transferred, while the striking agent scores in about 75% of cases. Therefore, the trained agent is robust enough and able to generalize to the target domain of Rocket League.
    Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path. (arXiv:2106.02073v4 [cs.LG] UPDATED)
    The recently discovered Neural Collapse (NC) phenomenon occurs pervasively in today's deep net training paradigm of driving cross-entropy (CE) loss towards zero. During NC, last-layer features collapse to their class-means, both classifiers and class-means collapse to the same Simplex Equiangular Tight Frame, and classifier behavior collapses to the nearest-class-mean decision rule. Recent works demonstrated that deep nets trained with mean squared error (MSE) loss perform comparably to those trained with CE. As a preliminary, we empirically establish that NC emerges in such MSE-trained deep nets as well through experiments on three canonical networks and five benchmark datasets. We provide, in a Google Colab notebook, PyTorch code for reproducing MSE-NC and CE-NC: at https://colab.research.google.com/github/neuralcollapse/neuralcollapse/blob/main/neuralcollapse.ipynb. The analytically-tractable MSE loss offers more mathematical opportunities than the hard-to-analyze CE loss, inspiring us to leverage MSE loss towards the theoretical investigation of NC. We develop three main contributions: (I) We show a new decomposition of the MSE loss into (A) terms directly interpretable through the lens of NC and which assume the last-layer classifier is exactly the least-squares classifier; and (B) a term capturing the deviation from this least-squares classifier. (II) We exhibit experiments on canonical datasets and networks demonstrating that term-(B) is negligible during training. This motivates us to introduce a new theoretical construct: the central path, where the linear classifier stays MSE-optimal for feature activations throughout the dynamics. (III) By studying renormalized gradient flow along the central path, we derive exact dynamics that predict NC.
    Concave Utility Reinforcement Learning with Zero-Constraint Violations. (arXiv:2109.05439v2 [cs.LG] UPDATED)
    We consider the problem of tabular infinite horizon concave utility reinforcement learning (CURL) with convex constraints. Various learning applications with constraints, such as robotics, do not allow for policies that can violate constraints. To this end, we propose a model-based learning algorithm that achieves zero constraint violations. To obtain this result, we assume that the concave objective and the convex constraints have a solution interior to the set of feasible occupation measures. We then solve a tighter optimization problem to ensure that the constraints are never violated despite the imprecise model knowledge and model stochasticity. We also propose a novel Bellman error based analysis for tabular infinite-horizon setups which allows to analyse stochastic policies. Combining the Bellman error based analysis and tighter optimization equation, for $T$ interactions with the environment, we obtain a regret guarantee for objective which grows as $\Tilde{O}(1/\sqrt{T})$, excluding other factors.
    Search-Based Testing of Reinforcement Learning. (arXiv:2205.04887v1 [cs.LG])
    Evaluation of deep reinforcement learning (RL) is inherently challenging. Especially the opaqueness of learned policies and the stochastic nature of both agents and environments make testing the behavior of deep RL agents difficult. We present a search-based testing framework that enables a wide range of novel analysis capabilities for evaluating the safety and performance of deep RL agents. For safety testing, our framework utilizes a search algorithm that searches for a reference trace that solves the RL task. The backtracking states of the search, called boundary states, pose safety-critical situations. We create safety test-suites that evaluate how well the RL agent escapes safety-critical situations near these boundary states. For robust performance testing, we create a diverse set of traces via fuzz testing. These fuzz traces are used to bring the agent into a wide variety of potentially unknown states from which the average performance of the agent is compared to the average performance of the fuzz traces. We apply our search-based testing approach on RL for Nintendo's Super Mario Bros.
    Domain Generalization: A Survey. (arXiv:2103.02503v5 [cs.LG] UPDATED)
    Generalization to out-of-distribution (OOD) data is a capability natural to humans yet challenging for machines to reproduce. This is because most learning algorithms strongly rely on the i.i.d.~assumption on source/target data, which is often violated in practice due to domain shift. Domain generalization (DG) aims to achieve OOD generalization by using only source data for model learning. Over the last ten years, research in DG has made great progress, leading to a broad spectrum of methodologies, e.g., those based on domain alignment, meta-learning, data augmentation, or ensemble learning, to name a few; DG has also been studied in various application areas including computer vision, speech recognition, natural language processing, medical imaging, and reinforcement learning. In this paper, for the first time a comprehensive literature review in DG is provided to summarize the developments over the past decade. Specifically, we first cover the background by formally defining DG and relating it to other relevant fields like domain adaptation and transfer learning. Then, we conduct a thorough review into existing methods and theories. Finally, we conclude this survey with insights and discussions on future research directions.
    Online AutoML: An adaptive AutoML framework for online learning. (arXiv:2201.09750v2 [cs.LG] UPDATED)
    Automated Machine Learning (AutoML) has been used successfully in settings where the learning task is assumed to be static. In many real-world scenarios, however, the data distribution will evolve over time, and it is yet to be shown whether AutoML techniques can effectively design online pipelines in dynamic environments. This study aims to automate pipeline design for online learning while continuously adapting to data drift. For this purpose, we design an adaptive Online Automated Machine Learning (OAML) system, searching the complete pipeline configuration space of online learners, including preprocessing algorithms and ensembling techniques. This system combines the inherent adaptation capabilities of online learners with the fast automated pipeline (re)optimization capabilities of AutoML. Focusing on optimization techniques that can adapt to evolving objectives, we evaluate asynchronous genetic programming and asynchronous successive halving to optimize these pipelines continually. We experiment on real and artificial data streams with varying types of concept drift to test the performance and adaptation capabilities of the proposed system. The results confirm the utility of OAML over popular online learning algorithms and underscore the benefits of continuous pipeline redesign in the presence of data drift.
    Cluster-based Input Weight Initialization for Echo State Networks. (arXiv:2103.04710v3 [cs.LG] CROSS LISTED)
    Echo State Networks (ESNs) are a special type of recurrent neural networks (RNNs), in which the input and recurrent connections are traditionally generated randomly, and only the output weights are trained. Despite the recent success of ESNs in various tasks of audio, image and radar recognition, we postulate that a purely random initialization is not the ideal way of initializing ESNs. The aim of this work is to propose an unsupervised initialization of the input connections using the $K$-Means algorithm on the training data. We show that for a large variety of datasets this initialization performs equivalently or superior than a randomly initialized ESN whilst needing significantly less reservoir neurons. Furthermore, we discuss that this approach provides the opportunity to estimate a suitable size of the reservoir based on prior knowledge about the data.
    Turtle Score -- Similarity Based Developer Analyzer. (arXiv:2205.04876v1 [stat.ML])
    In day-to-day life, a highly demanding task for IT companies is to find the right candidates who fit the companies' culture. This research aims to comprehend, analyze and automatically produce convincing outcomes to find a candidate who perfectly fits right in the company. Data is examined and collected for each employee who works in the IT domain focusing on their performance measure. This is done based on various different categories which bring versatility and a wide view of focus. To this data, learner analysis is done using machine learning algorithms to obtain learner similarity and developer similarity in order to recruit people with identical working patterns. It's been proven that the efficiency and capability of a particular worker go higher when working with a person of a similar personality. Therefore this will serve as a useful tool for recruiters who aim to recruit people with high productivity. This is to say that the model designed will render the best outcome possible with high accuracy and an immaculate recommendation score.
    A Wasserstein distance approach for concentration of empirical risk estimates. (arXiv:1902.10709v4 [math.ST] UPDATED)
    This paper presents a unified approach based on Wasserstein distance to derive concentration bounds for empirical estimates for two broad classes of risk measures defined in the paper. The classes of risk measures introduced include as special cases well known risk measures from the finance literature such as conditional value at risk (CVaR), optimized certainty equivalent risk, spectral risk measures, utility-based shortfall risk, cumulative prospect theory (CPT) value, rank dependent expected utility and distorted risk measures. Two estimation schemes are considered, one for each class of risk measures. One estimation scheme involves applying the risk measure to the empirical distribution function formed from a collection of i.i.d. samples of the random variable (r.v.), while the second scheme involves applying the same procedure to a truncated sample. The bounds provided apply to three popular classes of distributions, namely sub-Gaussian, sub-exponential and heavy-tailed distributions. The bounds are derived by first relating the estimation error to the Wasserstein distance between the true and empirical distributions, and then using recent concentration bounds for the latter. Previous concentration bounds are available only for specific risk measures such as CVaR and CPT-value. The bounds derived in this paper are shown to either match or improve upon previous bounds in cases where they are available. The usefulness of the bounds is illustrated through an algorithm and the corresponding regret bound for a stochastic bandit problem involving a general risk measure from each of the two classes introduced in the paper.
    Don't Throw it Away! The Utility of Unlabeled Data in Fair Decision Making. (arXiv:2205.04790v1 [stat.ML])
    Decision making algorithms, in practice, are often trained on data that exhibits a variety of biases. Decision-makers often aim to take decisions based on some ground-truth target that is assumed or expected to be unbiased, i.e., equally distributed across socially salient groups. In many practical settings, the ground-truth cannot be directly observed, and instead, we have to rely on a biased proxy measure of the ground-truth, i.e., biased labels, in the data. In addition, data is often selectively labeled, i.e., even the biased labels are only observed for a small fraction of the data that received a positive decision. To overcome label and selection biases, recent work proposes to learn stochastic, exploring decision policies via i) online training of new policies at each time-step and ii) enforcing fairness as a constraint on performance. However, the existing approach uses only labeled data, disregarding a large amount of unlabeled data, and thereby suffers from high instability and variance in the learned decision policies at different times. In this paper, we propose a novel method based on a variational autoencoder for practical fair decision-making. Our method learns an unbiased data representation leveraging both labeled and unlabeled data and uses the representations to learn a policy in an online process. Using synthetic data, we empirically validate that our method converges to the optimal (fair) policy according to the ground-truth with low variance. In real-world experiments, we further show that our training approach not only offers a more stable learning process but also yields policies with higher fairness as well as utility than previous approaches.
    Disentangling A Single MR Modality. (arXiv:2205.04982v1 [eess.IV])
    Disentangling anatomical and contrast information from medical images has gained attention recently, demonstrating benefits for various image analysis tasks. Current methods learn disentangled representations using either paired multi-modal images with the same underlying anatomy or auxiliary labels (e.g., manual delineations) to provide inductive bias for disentanglement. However, these requirements could significantly increase the time and cost in data collection and limit the applicability of these methods when such data are not available. Moreover, these methods generally do not guarantee disentanglement. In this paper, we present a novel framework that learns theoretically and practically superior disentanglement from single modality magnetic resonance images. Moreover, we propose a new information-based metric to quantitatively evaluate disentanglement. Comparisons over existing disentangling methods demonstrate that the proposed method achieves superior performance in both disentanglement and cross-domain image-to-image translation tasks.
    On Recurrent Neural Networks for learning-based control: recent results and ideas for future developments. (arXiv:2111.13557v2 [eess.SY] UPDATED)
    This paper aims to discuss and analyze the potentialities of Recurrent Neural Networks (RNN) in control design applications. The main families of RNN are considered, namely Neural Nonlinear AutoRegressive eXogenous, (NNARX), Echo State Networks (ESN), Long Short Term Memory (LSTM), and Gated Recurrent Units (GRU). The goal is twofold. Firstly, to survey recent results concerning the training of RNN that enjoy Input-to-State Stability (ISS) and Incremental Input-to-State Stability ($\delta$ISS) guarantees. Secondly, to discuss the issues that still hinder the widespread use of RNN for control, namely their robustness, verifiability, and interpretability. The former properties are related to the so-called generalization capabilities of the networks, i.e. their consistency with the underlying real plants, even in presence of unseen or perturbed input trajectories. The latter is instead related to the possibility of providing a clear formal connection between the RNN model and the plant. In this context, we illustrate how ISS and $\delta$ISS represent a significant step towards the robustness and verifiability of the RNN models, while the requirement of interpretability paves the way to the use of physics-based networks. The design of model predictive controllers with RNN as plant's model is also briefly discussed. Lastly, some of the main topics of the paper are illustrated on a simulated chemical system.
    Spike-based computational models of bio-inspired memories in the hippocampal CA3 region on SpiNNaker. (arXiv:2205.04782v1 [cs.NE])
    The human brain is the most powerful and efficient machine in existence today, surpassing in many ways the capabilities of modern computers. Currently, lines of research in neuromorphic engineering are trying to develop hardware that mimics the functioning of the brain to acquire these superior capabilities. One of the areas still under development is the design of bio-inspired memories, where the hippocampus plays an important role. This region of the brain acts as a short-term memory with the ability to store associations of information from different sensory streams in the brain and recall them later. This is possible thanks to the recurrent collateral network architecture that constitutes CA3, the main sub-region of the hippocampus. In this work, we developed two spike-based computational models of fully functional hippocampal bio-inspired memories for the storage and recall of complex patterns implemented with spiking neural networks on the SpiNNaker hardware platform. These models present different levels of biological abstraction, with the first model having a constant oscillatory activity closer to the biological model, and the second one having an energy-efficient regulated activity, which, although it is still bio-inspired, opts for a more functional approach. Different experiments were performed for each of the models, in order to test their learning/recalling capabilities. A comprehensive comparison between the functionality and the biological plausibility of the presented models was carried out, showing their strengths and weaknesses. The two models, which are publicly available for researchers, could pave the way for future spike-based implementations and applications.
    Gradient flows on graphons: existence, convergence, continuity equations. (arXiv:2111.09459v2 [math.PR] UPDATED)
    Wasserstein gradient flows on probability measures have found a host of applications in various optimization problems. They typically arise as the continuum limit of exchangeable particle systems evolving by some mean-field interaction involving a gradient-type potential. However, in many problems, such as in multi-layer neural networks, the so-called particles are edge weights on large graphs whose nodes are exchangeable. Such large graphs are known to converge to continuum limits called graphons as their size grow to infinity. We show that the Euclidean gradient flow of a suitable function of the edge-weights converges to a novel continuum limit given by a curve on the space of graphons that can be appropriately described as a gradient flow or, more technically, a curve of maximal slope. Several natural functions on graphons, such as homomorphism functions and the scalar entropy, are covered by our set-up, and the examples have been worked out in detail.
    Classification and mapping of low-statured 'shrubland' cover types in post-agricultural landscapes of the US Northeast. (arXiv:2205.05047v1 [cs.CV])
    Context: Novel plant communities reshape landscapes and pose challenges for land cover classification and mapping that can constrain research and stewardship efforts. In the US Northeast, emergence of low-statured woody vegetation, or 'shrublands', instead of secondary forests in post-agricultural landscapes is well-documented by field studies, but poorly understood from a landscape perspective, which limits the ability to systematically study and manage these lands. Objectives: To address gaps in classification/mapping of low-statured cover types where they have been historically rare, we developed models to predict 'shrubland' distributions at 30m resolution across New York State (NYS), using machine learning and model ensembling techniques to integrate remote sensing of structural (airborne LIDAR) and optical (satellite imagery) properties of vegetation cover. We first classified a 1m canopy height model (CHM), derived from a "patchwork" of available LIDAR coverages, to define shrubland presence/absence. Next, these non-contiguous maps were used to train a model ensemble based on temporally-segmented imagery to predict 'shrubland' probability for the entire study landscape (NYS). Results: Approximately 2.5% of the CHM coverage area was classified as shrubland. Models using Landsat predictors trained on the classified CHM were effective at identifying shrubland (test set AUC=0.893, real-world AUC=0.904), in discriminating between shrub/young forest and other cover classes, and produced qualitatively sensible maps, even when extending beyond the original training data. Conclusions: After ground-truthing, we expect these shrubland maps and models will have many research and stewardship applications including wildlife conservation, invasive species mitigation and natural climate solutions.
    Long-term stability and generalization of observationally-constrained stochastic data-driven models for geophysical turbulence. (arXiv:2205.04601v1 [cs.LG])
    Recent years have seen a surge in interest in building deep learning-based fully data-driven models for weather prediction. Such deep learning models if trained on observations can mitigate certain biases in current state-of-the-art weather models, some of which stem from inaccurate representation of subgrid-scale processes. However, these data-driven models, being over-parameterized, require a lot of training data which may not be available from reanalysis (observational data) products. Moreover, an accurate, noise-free, initial condition to start forecasting with a data-driven weather model is not available in realistic scenarios. Finally, deterministic data-driven forecasting models suffer from issues with long-term stability and unphysical climate drift, which makes these data-driven models unsuitable for computing climate statistics. Given these challenges, previous studies have tried to pre-train deep learning-based weather forecasting models on a large amount of imperfect long-term climate model simulations and then re-train them on available observational data. In this paper, we propose a convolutional variational autoencoder-based stochastic data-driven model that is pre-trained on an imperfect climate model simulation from a 2-layer quasi-geostrophic flow and re-trained, using transfer learning, on a small number of noisy observations from a perfect simulation. This re-trained model then performs stochastic forecasting with a noisy initial condition sampled from the perfect simulation. We show that our ensemble-based stochastic data-driven model outperforms a baseline deterministic encoder-decoder-based convolutional model in terms of short-term skills while remaining stable for long-term climate simulations yielding accurate climatology.
    Hyperparameter optimization of hybrid quantum neural networks for car classification. (arXiv:2205.04878v1 [quant-ph])
    Image recognition is one of the primary applications of machine learning algorithms. Nevertheless, machine learning models used in modern image recognition systems consist of millions of parameters that usually require significant computational time to be adjusted. Moreover, adjustment of model hyperparameters leads to additional overhead. Because of this, new developments in machine learning models and hyperparameter optimization techniques are required. This paper presents a quantum-inspired hyperparameter optimization technique and a hybrid quantum-classical machine learning model for supervised learning. We benchmark our hyperparameter optimization method over standard black-box objective functions and observe performance improvements in the form of reduced expected run times and fitness in response to the growth in the size of the search space. We test our approaches in a car image classification task, and demonstrate a full-scale implementation of the hybrid quantum neural network model with the tensor train hyperparameter optimization. Our tests show a qualitative and quantitative advantage over the corresponding standard classical tabular grid search approach used with a deep neural network ResNet34. A classification accuracy of 0.97 was obtained by the hybrid model after 18 iterations, whereas the classical model achieved an accuracy of 0.92 after 75 iterations.
    PyRCN: A Toolbox for Exploration and Application of Reservoir Computing Networks. (arXiv:2103.04807v3 [cs.LG] UPDATED)
    Reservoir Computing Networks (RCNs) belong to a group of machine learning techniques that project the input space non-linearly into a high-dimensional feature space, where the underlying task can be solved linearly. Popular variants of RCNs are capable of solving complex tasks equivalently to widely used deep neural networks, but with a substantially simpler training paradigm based on linear regression. In this paper, we show how to uniformly describe RCNs with small and clearly defined building blocks, and we introduce the Python toolbox PyRCN (Python Reservoir Computing Networks) for optimizing, training and analyzing RCNs on arbitrarily large datasets. The tool is based on widely-used scientific packages and complies with the scikit-learn interface specification. It provides a platform for educational and exploratory analyses of RCNs, as well as a framework to apply RCNs on complex tasks including sequence processing. With a small number of building blocks, the framework allows the implementation of numerous different RCN architectures. We provide code examples on how to set up RCNs for time series prediction and for sequence classification tasks. PyRCN is around ten times faster than reference toolboxes on a benchmark task while requiring substantially less boilerplate code.
    Optimizing over an ensemble of neural networks. (arXiv:2112.07007v2 [cs.LG] UPDATED)
    We study optimization problems where the objective function is modeled through feedforward neural networks with rectified linear unit (ReLU) activation. Recent literature has explored the use of a single neural network to model either uncertain or complex elements within an objective function. However, it is well known that ensembles of neural networks produce more stable predictions and have better generalizability than models with single neural networks, which motivates the investigation of ensembles of neural networks rather than single neural networks in decision-making pipelines. We study how to incorporate a neural network ensemble as the objective function of an optimization model and explore computational approaches for the ensuing problem. We present a mixed-integer linear program based on existing popular big-M formulations for optimizing over a single neural network. We develop a two-phase approach for our model that combines preprocessing procedures to tighten bounds for critical neurons in the neural networks with a Lagrangian relaxation-based branch-and-bound approach. Experimental evaluations of our solution methods suggest that using ensembles of neural networks yields more stable and higher quality solutions, compared to single neural networks, and that our optimization algorithm outperforms (the adaption of) a state-of-the-art approach in terms of computational time and optimality gaps.
    DeepAuditor: Distributed Online Intrusion Detection System for IoT devices via Power Side-channel Auditing. (arXiv:2106.12753v3 [cs.CR] UPDATED)
    As the number of IoT devices has increased rapidly, IoT botnets have exploited the vulnerabilities of IoT devices. However, it is still challenging to detect the initial intrusion on IoT devices prior to massive attacks. Recent studies have utilized power side-channel information to identify this intrusion behavior on IoT devices but still lack accurate models in real-time for ubiquitous botnet detection. We proposed the first online intrusion detection system called DeepAuditor for IoT devices via power auditing. To develop the real-time system, we proposed a lightweight power auditing device called Power Auditor. We also designed a distributed CNN classifier for online inference in a laboratory setting. In order to protect data leakage and reduce networking redundancy, we then proposed a privacy-preserved inference protocol via Packed Homomorphic Encryption and a sliding window protocol in our system. The classification accuracy and processing time were measured, and the proposed classifier outperformed a baseline classifier, especially against unseen patterns. We also demonstrated that the distributed CNN design is secure against any distributed components. Overall, the measurements were shown to the feasibility of our real-time distributed system for intrusion detection on IoT devices.
    Adaptation Strategies for Automated Machine Learning on Evolving Data. (arXiv:2006.06480v3 [cs.LG] UPDATED)
    Automated Machine Learning (AutoML) systems have been shown to efficiently build good models for new datasets. However, it is often not clear how well they can adapt when the data evolves over time. The main goal of this study is to understand the effect of data stream challenges such as concept drift on the performance of AutoML methods, and which adaptation strategies can be employed to make them more robust. To that end, we propose 6 concept drift adaptation strategies and evaluate their effectiveness on different AutoML approaches. We do this for a variety of AutoML approaches for building machine learning pipelines, including those that leverage Bayesian optimization, genetic programming, and random search with automated stacking. These are evaluated empirically on real-world and synthetic data streams with different types of concept drift. Based on this analysis, we propose ways to develop more sophisticated and robust AutoML techniques.
    GRU-TV: Time- and velocity-aware GRU for patient representation on multivariate clinical time-series data. (arXiv:2205.04892v1 [cs.LG])
    Electronic health records (EHRs) provide a rich repository to track a patient's health status. EHRs seek to fully document the patient's physiological status, and include data that is is high dimensional, heterogeneous, and multimodal. The significant differences in the sampling frequency of clinical variables can result in high missing rates and uneven time intervals between adjacent records in the multivariate clinical time-series data extracted from EHRs. Current studies using clinical time-series data for patient characterization view the patient's physiological status as a discrete process described by sporadically collected values, while the dynamics in patient's physiological status are time-continuous. In addition, recurrent neural networks (RNNs) models widely used for patient representation learning lack the perception of time intervals and velocity, which limits the ability of the model to represent the physiological status of the patient. In this paper, we propose an improved gated recurrent unit (GRU), namely time- and velocity-aware GRU (GRU-TV), for patient representation learning of clinical multivariate time-series data in a time-continuous manner. In proposed GRU-TV, the neural ordinary differential equations (ODEs) and velocity perception mechanism are used to perceive the time interval between records in the time-series data and changing rate of the patient's physiological status, respectively. Experimental results on two real-world clinical EHR datasets(PhysioNet2012, MIMIC-III) show that GRU-TV achieve state-of-the-art performance in computer aided diagnosis (CAD) tasks, and is more advantageous in processing sampled data.
    Learning Combinatorial Node Labeling Algorithms. (arXiv:2106.03594v3 [cs.LG] UPDATED)
    We present a novel neural architecture to solve graph optimization problems where the solution consists of arbitrary node labels, allowing us to solve hard problems like graph coloring. We train our model using reinforcement learning, specifically policy gradients, which gives us both a greedy and a probabilistic policy. Our architecture builds on a graph attention network and uses several inductive biases to improve solution quality. Our learned deterministic heuristics for graph coloring give better solutions than classical degree-based greedy heuristics and only take seconds to apply to graphs with tens of thousands of vertices. Moreover, our probabilistic policies outperform all greedy state-of-the-art coloring baselines and a machine learning baseline. Finally, we show that our approach also generalizes to other problems by evaluating it on minimum vertex cover and outperforming two greedy heuristics.
    Accelerated functional brain aging in major depressive disorder: evidence from a large scale fMRI analysis of Chinese participants. (arXiv:2205.04871v1 [q-bio.NC])
    Major depressive disorder (MDD) is one of the most common mental health conditions that has been intensively investigated for its association with brain atrophy and mortality. Recent studies reveal that the deviation between the predicted and the chronological age can be a marker of accelerated brain aging to characterize MDD. However, current conclusions are usually drawn based on structural MRI information collected from Caucasian participants. The universality of this biomarker needs to be further validated by subjects with different ethnic/racial backgrounds and by different types of data. Here we make use of the REST-meta-MDD, a large scale resting-state fMRI dataset collected from multiple cohort participants in China. We develop a stacking machine learning model based on 1101 healthy controls, which estimates a subject's chronological age from fMRI with promising accuracy. The trained model is then applied to 1276 MDD patients from 24 sites. We observe that MDD patients exhibit a $+4.43$ years ($\text{$p$} < 0.0001$, $\text{Cohen's $d$} = 0.35$, $\text{95\% CI}:1.86 - 3.91$) higher brain-predicted age difference (brain-PAD) compared to controls. In the MDD subgroup, we observe a statistically significant $+2.09$ years ($\text{$p$} < 0.05$, $\text{Cohen's $d$} = 0.134483$) brain-PAD in antidepressant users compared to medication-free patients. The statistical relationship observed is further checked by three different machine learning algorithms. The positive brain-PAD observed in participants in China confirms the presence of accelerated brain aging in MDD patients. The utilization of functional brain connectivity for age estimation verifies existing findings from a new dimension.
    Pediatric Automatic Sleep Staging: A comparative study of state-of-the-art deep learning methods. (arXiv:2108.10211v3 [eess.SP] UPDATED)
    Background: Despite the tremendous progress recently made towards automatic sleep staging in adults, it is currently unknown if the most advanced algorithms generalize to the pediatric population, which displays distinctive characteristics in overnight polysomnography (PSG). Methods: To answer the question, in this work, we conduct a large-scale comparative study on the state-of-the-art deep learning methods for pediatric automatic sleep staging. Six different deep neural networks with diverging features are adopted to evaluate a sample of more than 1,200 children across a wide spectrum of obstructive sleep apnea (OSA) severity. Results: Our experimental results show that the individual performance of automated pediatric sleep stagers when evaluated on new subjects is equivalent to the expert-level one reported on adults. Combining the six stagers into ensemble models further boosts the staging accuracy, reaching an overall accuracy of 88.8%, a Cohen's kappa of 0.852, and a macro F1-score of 85.8%. At the same time, the ensemble models lead to reduced predictive uncertainty. The results also show that the studied algorithms and their ensembles are robust to concept drift when the training and test data were recorded seven months apart and after clinical intervention. Conclusion: However, we show that the improvements in the staging performance are not necessarily clinically significant although the ensemble models lead to more favorable clinical measures than the six standalone models. Significance: Detailed analyses further demonstrate "almost perfect" agreement between the automatic stagers to one another and their similar patterns on the staging errors, suggesting little room for improvement.
    Differentially Private Learning with Adaptive Clipping. (arXiv:1905.03871v5 [cs.LG] UPDATED)
    Existing approaches for training neural networks with user-level differential privacy (e.g., DP Federated Averaging) in federated learning (FL) settings involve bounding the contribution of each user's model update by clipping it to some constant value. However there is no good a priori setting of the clipping norm across tasks and learning settings: the update norm distribution depends on the model architecture and loss, the amount of data on each device, the client learning rate, and possibly various other parameters. We propose a method wherein instead of a fixed clipping norm, one clips to a value at a specified quantile of the update norm distribution, where the value at the quantile is itself estimated online, with differential privacy. The method tracks the quantile closely, uses a negligible amount of privacy budget, is compatible with other federated learning technologies such as compression and secure aggregation, and has a straightforward joint DP analysis with DP-FedAvg. Experiments demonstrate that adaptive clipping to the median update norm works well across a range of realistic federated learning tasks, sometimes outperforming even the best fixed clip chosen in hindsight, and without the need to tune any clipping hyperparameter.
    Adjusted Expected Improvement for Cumulative Regret Minimization in Noisy Bayesian Optimization. (arXiv:2205.04901v1 [cs.LG])
    The expected improvement (EI) is one of the most popular acquisition functions for Bayesian optimization (BO) and has demonstrated good empirical performances in many applications for the minimization of simple regret. However, under the evaluation metric of cumulative regret, the performance of EI may not be competitive, and its existing theoretical regret upper bound still has room for improvement. To adapt the EI for better performance under cumulative regret, we introduce a novel quantity called the evaluation cost which is compared against the acquisition function, and with this, develop the expected improvement-cost (EIC) algorithm. In each iteration of EIC, a new point with the largest acquisition function value is sampled, only if that value exceeds its evaluation cost. If none meets this criteria, the current best point is resampled.This evaluation cost quantifies the potential downside of sampling a point, which is important under the cumulative regret metric as the objective function value in every iteration affects the performance measure. We further establish in theory a near-optimal regret upper bound of EIC for the squared-exponential covariance kernel under mild regularity conditions, and perform experiments to illustrate the improvement of EIC over several popular BO algorithms.
    White-box Testing of NLP models with Mask Neuron Coverage. (arXiv:2205.05050v1 [cs.CL])
    Recent literature has seen growing interest in using black-box strategies like CheckList for testing the behavior of NLP models. Research on white-box testing has developed a number of methods for evaluating how thoroughly the internal behavior of deep models is tested, but they are not applicable to NLP models. We propose a set of white-box testing methods that are customized for transformer-based NLP models. These include Mask Neuron Coverage (MNCOVER) that measures how thoroughly the attention layers in models are exercised during testing. We show that MNCOVER can refine testing suites generated by CheckList by substantially reduce them in size, for more than 60\% on average, while retaining failing tests -- thereby concentrating the fault detection power of the test suite. Further we show how MNCOVER can be used to guide CheckList input generation, evaluate alternative NLP testing methods, and drive data augmentation to improve accuracy.
    Deep learning based Chinese text sentiment mining and stock market correlation research. (arXiv:2205.04743v1 [q-fin.CP])
    We explore how to crawl financial forum data such as stock bars and combine them with deep learning models for sentiment analysis. In this paper, we will use the BERT model to train against the financial corpus and predict the SZSE Component Index, and find that applying the BERT model to the financial corpus through the maximum information coefficient comparison study. The obtained sentiment features will be able to reflect the fluctuations in the stock market and help to improve the prediction accuracy effectively. Meanwhile, this paper combines deep learning with financial text, in further exploring the mechanism of investor sentiment on stock market through deep learning method, which will be beneficial for national regulators and policy departments to develop more reasonable policy guidelines for maintaining the stability of stock market.
    Control Prefixes for Parameter-Efficient Text Generation. (arXiv:2110.08329v2 [cs.CL] UPDATED)
    Prefix-tuning is a powerful lightweight technique for adapting a large pre-trained language model to a downstream application. However, it uses the same dataset-level tuned prompt for all examples in the dataset. We extend this idea and propose a dynamic method, Control Prefixes, which allows for the inclusion of conditional input-dependent information, combining the benefits of prompt tuning and controlled generation. The method incorporates attribute-level learnable representations into different layers of a pre-trained transformer, allowing for the generated text to be guided in a particular direction. We provide a systematic evaluation of the technique and apply it to five datasets from the GEM benchmark for natural language generation (NLG). Although the aim is to develop a parameter-efficient model, we show Control Prefixes can even outperform full fine-tuning methods. We present state-of-the-art results on several data-to-text datasets, including WebNLG.  ( 2 min )
    Measure and Improve Robustness in NLP Models: A Survey. (arXiv:2112.08313v2 [cs.CL] UPDATED)
    As NLP models achieved state-of-the-art performances over benchmarks and gained wide applications, it has been increasingly important to ensure the safe deployment of these models in the real world, e.g., making sure the models are robust against unseen or challenging scenarios. Despite robustness being an increasingly studied topic, it has been separately explored in applications like vision and NLP, with various definitions, evaluation and mitigation strategies in multiple lines of research. In this paper, we aim to provide a unifying survey of how to define, measure and improve robustness in NLP. We first connect multiple definitions of robustness, then unify various lines of work on identifying robustness failures and evaluating models' robustness. Correspondingly, we present mitigation strategies that are data-driven, model-driven, and inductive-prior-based, with a more systematic view of how to effectively improve robustness in NLP models. Finally, we conclude by outlining open challenges and future directions to motivate further research in this area.  ( 2 min )
    Large Neighborhood Search based on Neural Construction Heuristics. (arXiv:2205.00772v2 [cs.LG] UPDATED)
    We propose a Large Neighborhood Search (LNS) approach utilizing a learned construction heuristic based on neural networks as repair operator to solve the vehicle routing problem with time windows (VRPTW). Our method uses graph neural networks to encode the problem and auto-regressively decodes a solution and is trained with reinforcement learning on the construction task without requiring any labels for supervision. The neural repair operator is combined with a local search routine, heuristic destruction operators and a selection procedure applied to a small population to arrive at a sophisticated solution approach. The key idea is to use the learned model to re-construct the partially destructed solution and to introduce randomness via the destruction heuristics (or the stochastic policy itself) to effectively explore a large neighborhood.  ( 2 min )
    Explainable Deep Learning Methods in Medical Diagnosis: A Survey. (arXiv:2205.04766v1 [eess.IV])
    The remarkable success of deep learning has prompted interest in its application to medical diagnosis. Even tough state-of-the-art deep learning models have achieved human-level accuracy on the classification of different types of medical data, these models are hardly adopted in clinical workflows, mainly due to their lack of interpretability. The black-box-ness of deep learning models has raised the need for devising strategies to explain the decision process of these models, leading to the creation of the topic of eXplainable Artificial Intelligence (XAI). In this context, we provide a thorough survey of XAI applied to medical diagnosis, including visual, textual, and example-based explanation methods. Moreover, this work reviews the existing medical imaging datasets and the existing metrics for evaluating the quality of the explanations . Complementary to most existing surveys, we include a performance comparison among a set of report generation-based methods. Finally, the major challenges in applying XAI to medical imaging are also discussed.  ( 2 min )
    Are Quantum Computers Practical Yet? A Case for Feature Selection in Recommender Systems using Tensor Networks. (arXiv:2205.04490v1 [cs.IR])
    Collaborative filtering models generally perform better than content-based filtering models and do not require careful feature engineering. However, in the cold-start scenario collaborative information may be scarce or even unavailable, whereas the content information may be abundant, but also noisy and expensive to acquire. Thus, selection of particular features that improve cold-start recommendations becomes an important and non-trivial task. In the recent approach by Nembrini et al., the feature selection is driven by the correlational compatibility between collaborative and content-based models. The problem is formulated as a Quadratic Unconstrained Binary Optimization (QUBO) which, due to its NP-hard complexity, is solved using Quantum Annealing on a quantum computer provided by D-Wave. Inspired by the reported results, we contend the idea that current quantum annealers are superior for this problem and instead focus on classical algorithms. In particular, we tackle QUBO via TTOpt, a recently proposed black-box optimizer based on tensor networks and multilinear algebra. We show the computational feasibility of this method for large problems with thousands of features, and empirically demonstrate that the solutions found are comparable to the ones obtained with D-Wave across all examined datasets.  ( 2 min )
    Unsupervised Belief Representation Learning with Information-Theoretic Variational Graph Auto-Encoders. (arXiv:2110.00210v5 [cs.SI] UPDATED)
    This paper develops a novel unsupervised algorithm for belief representation learning in polarized networks that (i) uncovers the latent dimensions of the underlying belief space and (ii) jointly embeds users and content items (that they interact with) into that space in a manner that facilitates a number of downstream tasks, such as stance detection, stance prediction, and ideology mapping. Inspired by total correlation in information theory, we propose the Information-Theoretic Variational Graph Auto-Encoder (InfoVGAE) that learns to project both users and content items (e.g., posts that represent user views) into an appropriate disentangled latent space. To better disentangle latent variables in that space, we develop a total correlation regularization module, a Proportional-Integral (PI) control module, and adopt rectified Gaussian distribution to ensure the orthogonality. The latent representation of users and content can then be used to quantify their ideological leaning and detect/predict their stances on issues. We evaluate the performance of the proposed InfoVGAE on three real-world datasets, of which two are collected from Twitter and one from U.S. Congress voting records. The evaluation results show that our model outperforms state-of-the-art unsupervised models by reducing 10.5% user clustering errors and achieving 12.1% higher F1 scores for stance separation of content items. In addition, InfoVGAE produces a comparable result with supervised models. We also discuss its performance on stance prediction and user ranking within ideological groups.  ( 3 min )
    Theory of Quantum Generative Learning Models with Maximum Mean Discrepancy. (arXiv:2205.04730v1 [quant-ph])
    The intrinsic probabilistic nature of quantum mechanics invokes endeavors of designing quantum generative learning models (QGLMs) with computational advantages over classical ones. To date, two prototypical QGLMs are quantum circuit Born machines (QCBMs) and quantum generative adversarial networks (QGANs), which approximate the target distribution in explicit and implicit ways, respectively. Despite the empirical achievements, the fundamental theory of these models remains largely obscure. To narrow this knowledge gap, here we explore the learnability of QCBMs and QGANs from the perspective of generalization when their loss is specified to be the maximum mean discrepancy. Particularly, we first analyze the generalization ability of QCBMs and identify their superiorities when the quantum devices can directly access the target distribution and the quantum kernels are employed. Next, we prove how the generalization error bound of QGANs depends on the employed Ansatz, the number of qudits, and input states. This bound can be further employed to seek potential quantum advantages in Hamiltonian learning tasks. Numerical results of QGLMs in approximating quantum states, Gaussian distribution, and ground states of parameterized Hamiltonians accord with the theoretical analysis. Our work opens the avenue for quantitatively understanding the power of quantum generative learning models.  ( 2 min )
    Matrix and graph representations of vine copula structures. (arXiv:2205.04783v1 [stat.ML])
    Vine copulas can efficiently model a large portion of probability distributions. This paper focuses on a more thorough understanding of their structures. We are building on well-known existing constructions to represent vine copulas with graphs as well as matrices. The graph representations include the regular, cherry and chordal graph sequence structures, which we show equivalence between. Importantly we also show that when a perfect elimination ordering of a vine structure is given, then it can always be uniquely represented with a matrix. O. M. N\'apoles has shown a way to represent them in a matrix, and we algorithmify this previous approach, while also showing a new method for constructing such a matrix, through cherry tree sequences. Lastly, we prove that these two matrix-building algorithms are equivalent if the same perfect elimination ordering is being used.  ( 2 min )
    Modeling Regime Shifts in Multiple Time Series. (arXiv:2109.09692v3 [cs.LG] UPDATED)
    We investigate the problem of discovering and modeling regime shifts in an ecosystem comprising multiple time series known as co-evolving time series. Regime shifts refer to the changing behaviors exhibited by series at different time intervals. Learning these changing behaviors is a key step toward time series forecasting. While advances have been made, existing methods suffer from one or more of the following shortcomings: (1) failure to take relationships between time series into consideration for discovering regimes in multiple time series; (2) lack of an effective approach that models time-dependent behaviors exhibited by series; (3) difficulties in handling data discontinuities which may be informative. Most of the existing methods are unable to handle all of these three issues in a unified framework. This, therefore, motivates our effort to devise a principled approach for modeling interactions and time-dependency in co-evolving time series. Specifically, we model an ecosystem of multiple time series by summarizing the heavy ensemble of time series into a lighter and more meaningful structure called a \textit{mapping grid}. By using the mapping grid, our model first learns time series behavioral dependencies through a dynamic network representation, then learns the regime transition mechanism via a full time-dependent Cox regression model. The originality of our approach lies in modeling interactions between time series in regime identification and in modeling time-dependent regime transition probabilities, usually assumed to be static in existing work.  ( 2 min )
    DeepTag: A General Framework for Fiducial Marker Design and Detection. (arXiv:2105.13731v2 [cs.CV] UPDATED)
    A fiducial marker system usually consists of markers, a detection algorithm, and a coding system. The appearance of markers and the detection robustness are generally limited by the existing detection algorithms, which are hand-crafted with traditional low-level image processing techniques. Furthermore, a sophisticatedly designed coding system is required to overcome the shortcomings of both markers and detection algorithms. To improve the flexibility and robustness in various applications, we propose a general deep learning based framework, DeepTag, for fiducial marker design and detection. DeepTag not only supports detection of a wide variety of existing marker families, but also makes it possible to design new marker families with customized local patterns. Moreover, we propose an effective procedure to synthesize training data on the fly without manual annotations. Thus, DeepTag can easily adapt to existing and newly-designed marker families. To validate DeepTag and existing methods, beside existing datasets, we further collect a new large and challenging dataset where markers are placed in different view distances and angles. Experiments show that DeepTag well supports different marker families and greatly outperforms the existing methods in terms of both detection robustness and pose accuracy. Both code and dataset are available at https://herohuyongtao.github.io/research/publications/deep-tag/.  ( 2 min )
    THOR: Threshold-Based Ranking Loss for Ordinal Regression. (arXiv:2205.04864v1 [cs.LG])
    In this work, we present a regression-based ordinal regression algorithm for supervised classification of instances into ordinal categories. In contrast to previous methods, in this work the decision boundaries between categories are predefined, and the algorithm learns to project the input examples onto their appropriate scores according to these predefined boundaries. This is achieved by adding a novel threshold-based pairwise loss function that aims at minimizing the regression error, which in turn minimizes the Mean Absolute Error (MAE) measure. We implemented our proposed architecture-agnostic method using the CNN-framework for feature extraction. Experimental results on five real-world benchmarks demonstrate that the proposed algorithm achieves the best MAE results compared to state-of-the-art ordinal regression algorithms.  ( 2 min )
    Designing a Recurrent Neural Network to Learn a Motion Planner for High-Dimensional Inputs. (arXiv:2205.04799v1 [cs.RO])
    The use of machine learning in the self-driving industry has boosted a number of recent advancements. In particular, the usage of large deep learning models in the perception and prediction stack have proved quite successful, but there still lacks significant literature on the use of machine learning in the planning stack. The current state of the art in the planning stack often relies on fast constrained optimization or rule-based approaches. Both of these techniques fail to address a significant number of fundamental problems that would allow the vehicle to operate more similarly to that of human drivers. In this paper, we attempt to design a basic deep learning system to approach this problem. Furthermore, the main underlying goal of this paper is to demonstrate the potential uses of machine learning in the planning stack for autonomous vehicles (AV) and provide a baseline work for ongoing and future research.  ( 2 min )
    Knowledge Augmented Machine Learning with Applications in Autonomous Driving: A Survey. (arXiv:2205.04712v1 [cs.LG])
    The existence of representative datasets is a prerequisite of many successful artificial intelligence and machine learning models. However, the subsequent application of these models often involves scenarios that are inadequately represented in the data used for training. The reasons for this are manifold and range from time and cost constraints to ethical considerations. As a consequence, the reliable use of these models, especially in safety-critical applications, is a huge challenge. Leveraging additional, already existing sources of knowledge is key to overcome the limitations of purely data-driven approaches, and eventually to increase the generalization capability of these models. Furthermore, predictions that conform with knowledge are crucial for making trustworthy and safe decisions even in underrepresented scenarios. This work provides an overview of existing techniques and methods in the literature that combine data-based models with existing knowledge. The identified approaches are structured according to the categories integration, extraction and conformity. Special attention is given to applications in the field of autonomous driving.  ( 2 min )
    Secure and Private Source Coding with Private Key and Decoder Side Information. (arXiv:2205.05068v1 [cs.IT])
    The problem of secure source coding with multiple terminals is extended by considering a remote source whose noisy measurements are the correlated random variables used for secure source reconstruction. The main additions to the problem include 1) all terminals noncausally observe a noisy measurement of the remote source; 2) a private key is available to all legitimate terminals; 3) the public communication link between the encoder and decoder is rate-limited; 4) the secrecy leakage to the eavesdropper is measured with respect to the encoder input, whereas the privacy leakage is measured with respect to the remote source. Exact rate regions are characterized for a lossy source coding problem with a private key, remote source, and decoder side information under security, privacy, communication, and distortion constraints. By replacing the distortion constraint with a reliability constraint, we obtain the exact rate region also for the lossless case. Furthermore, the lossy rate region for scalar discrete-time Gaussian sources and measurement channels is established.  ( 2 min )
    Explainable Data Imputation using Constraints. (arXiv:2205.04731v1 [cs.AI])
    Data values in a dataset can be missing or anomalous due to mishandling or human error. Analysing data with missing values can create bias and affect the inferences. Several analysis methods, such as principle components analysis or singular value decomposition, require complete data. Many approaches impute numeric data and some do not consider dependency of attributes on other attributes, while some require human intervention and domain knowledge. We present a new algorithm for data imputation based on different data type values and their association constraints in data, which are not handled currently by any system. We show experimental results using different metrics comparing our algorithm with state of the art imputation techniques. Our algorithm not only imputes the missing values but also generates human readable explanations describing the significance of attributes used for every imputation.  ( 2 min )
    Sensible AI: Re-imagining Interpretability and Explainability using Sensemaking Theory. (arXiv:2205.05057v1 [cs.HC])
    Understanding how ML models work is a prerequisite for responsibly designing, deploying, and using ML-based systems. With interpretability approaches, ML can now offer explanations for its outputs to aid human understanding. Though these approaches rely on guidelines for how humans explain things to each other, they ultimately solve for improving the artifact -- an explanation. In this paper, we propose an alternate framework for interpretability grounded in Weick's sensemaking theory, which focuses on who the explanation is intended for. Recent work has advocated for the importance of understanding stakeholders' needs -- we build on this by providing concrete properties (e.g., identity, social context, environmental cues, etc.) that shape human understanding. We use an application of sensemaking in organizations as a template for discussing design guidelines for Sensible AI, AI that factors in the nuances of human cognition when trying to explain itself.  ( 2 min )
    ALLSH: Active Learning Guided by Local Sensitivity and Hardness. (arXiv:2205.04980v1 [cs.CL])
    Active learning, which effectively collects informative unlabeled data for annotation, reduces the demand for labeled data. In this work, we propose to retrieve unlabeled samples with a local sensitivity and hardness-aware acquisition function. The proposed method generates data copies through local perturbations and selects data points whose predictive likelihoods diverge the most from their copies. We further empower our acquisition function by injecting the select-worst case perturbation. Our method achieves consistent gains over the commonly used active learning strategies in various classification tasks. Furthermore, we observe consistent improvements over the baselines on the study of prompt selection in prompt-based few-shot learning. These experiments demonstrate that our acquisition guided by local sensitivity and hardness can be effective and beneficial for many NLP tasks.  ( 2 min )
    Tensor-based Collaborative Filtering With Smooth Ratings Scale. (arXiv:2205.05070v1 [cs.IR])
    Conventional collaborative filtering techniques don't take into consideration the effect of discrepancy in users' rating perception. Some users may rarely give 5 stars to items while others almost always assign 5 stars to the chosen item. Even if they had experience with the same items this systematic discrepancy in their evaluation style will lead to the systematic errors in the ability of recommender system to effectively extract right patterns from data. To mitigate this problem we introduce the ratings' similarity matrix which represents the dependency between different values of ratings on the population level. Hence, if on average the correlations between ratings exist, it is possible to improve the quality of proposed recommendations by off-setting the effect of either shifted down or shifted up users' rates.  ( 2 min )
    Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation. (arXiv:2102.07301v2 [cs.LG] UPDATED)
    We study reinforcement learning in an infinite-horizon average-reward setting with linear function approximation, where the transition probability function of the underlying Markov Decision Process (MDP) admits a linear form over a feature mapping of the current state, action, and next state. We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation. We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $\tilde{O}(d\sqrt{DT})$, where $d$ is the dimension of the feature mapping, $T$ is the horizon, and $\sqrt{D}$ is the diameter of the MDP. We also prove a matching lower bound $\tilde{\Omega}(d\sqrt{DT})$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors. To the best of our knowledge, our algorithm is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting.  ( 2 min )
    Reconstruction Enhanced Multi-View Contrastive Learning for Anomaly Detection on Attributed Networks. (arXiv:2205.04816v1 [cs.LG])
    Detecting abnormal nodes from attributed networks is of great importance in many real applications, such as financial fraud detection and cyber security. This task is challenging due to both the complex interactions between the anomalous nodes with other counterparts and their inconsistency in terms of attributes. This paper proposes a self-supervised learning framework that jointly optimizes a multi-view contrastive learning-based module and an attribute reconstruction-based module to more accurately detect anomalies on attributed networks. Specifically, two contrastive learning views are firstly established, which allow the model to better encode rich local and global information related to the abnormality. Motivated by the attribute consistency principle between neighboring nodes, a masked autoencoder-based reconstruction module is also introduced to identify the nodes which have large reconstruction errors, then are regarded as anomalies. Finally, the two complementary modules are integrated for more accurately detecting the anomalous nodes. Extensive experiments conducted on five benchmark datasets show our model outperforms current state-of-the-art models.  ( 2 min )
    Universal Caching. (arXiv:2205.04860v1 [cs.IT])
    In the learning literature, the performance of an online policy is commonly measured in terms of the static regret metric, which compares the cumulative loss of an online policy to that of an optimal benchmark in hindsight. In the definition of static regret, the benchmark policy remains fixed throughout the time horizon. Naturally, the resulting regret bounds become loose in non-stationary settings where fixed benchmarks often suffer from poor performance. In this paper, we investigate a stronger notion of regret minimization in the context of an online caching problem. In particular, we allow the action of the offline benchmark at any round to be decided by a finite state predictor containing arbitrarily many states. Using ideas from the universal prediction literature in information theory, we propose an efficient online caching policy with an adaptive sub-linear regret bound. To the best of our knowledge, this is the first data-dependent regret bound known for the universal caching problem. We establish this result by combining a recently-proposed online caching policy with an incremental parsing algorithm, e.g., Lempel-Ziv '78. Our methods also yield a simpler learning-theoretic proof of the improved regret bound as opposed to the more involved and problem-specific combinatorial arguments used in the earlier works.  ( 2 min )
    Weakly-supervised segmentation of referring expressions. (arXiv:2205.04725v1 [cs.CV])
    Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions. In this work we address image segmentation from referring expressions, a problem that has so far only been addressed in a fully-supervised setting. A fully-supervised setup, however, requires pixel-wise supervision and is hard to scale given the expense of manual annotation. We therefore introduce a new task of weakly-supervised image segmentation from referring expressions and propose Text grounded semantic SEGgmentation (TSEG) that learns segmentation masks directly from image-level referring expressions without pixel-level annotations. Our transformer-based method computes patch-text similarities and guides the classification objective during training with a new multi-label patch assignment mechanism. The resulting visual grounding model segments image regions corresponding to given natural language expressions. Our approach TSEG demonstrates promising results for weakly-supervised referring expression segmentation on the challenging PhraseCut and RefCOCO datasets. TSEG also shows competitive performance when evaluated in a zero-shot setting for semantic segmentation on Pascal VOC.  ( 2 min )
    Flow Completion Network: Inferring the Fluid Dynamics from Incomplete Flow Information using Graph Neural Networks. (arXiv:2205.04739v1 [physics.flu-dyn])
    This paper introduces a novel neural network -- the flow completion network (FCN) -- to infer the fluid dynamics, including the flow field and the force acting on the body, from the incomplete data based on Graph Convolution Attention Network. The FCN is composed of several graph convolution layers and spatial attention layers. It is designed to infer the velocity field and the vortex force contribution of the flow field when combined with the vortex force map (VFM) method. Compared with other neural networks adopted in fluid dynamics, the FCN is capable of dealing with both structured data and unstructured data. The performance of the proposed FCN is assessed by the computational fluid dynamics (CFD) data on the flow field around a circular cylinder. The force coefficients predicted by our model are validated against those obtained directly from CFD. Moreover, it is shown that our model effectively utilizes the existing flow field information and the gradient information simultaneously, giving a better performance than the traditional CNN-based and DNN-based models.  ( 2 min )
    AI training resources for GLAM: a snapshot. (arXiv:2205.04738v1 [cs.LG])
    We take a snapshot of current resources available for teaching and learning AI with a focus on the Galleries, Libraries, Archives and Museums (GLAM) community. The review was carried out in 2021 and 2022. The review provides an overview of material we identified as being relevant, offers a description of this material and makes recommendations for future work in this area.  ( 2 min )
    A spatial-temporal short-term traffic flow prediction model based on dynamical-learning graph convolution mechanism. (arXiv:2205.04762v1 [cs.LG])
    Short-term traffic flow prediction is a vital branch of the Intelligent Traffic System (ITS) and plays an important role in traffic management. Graph convolution network (GCN) is widely used in traffic prediction models to better deal with the graphical structure data of road networks. However, the influence weights among different road sections are usually distinct in real life, and hard to be manually analyzed. Traditional GCN mechanism, relying on manually-set adjacency matrix, is unable to dynamically learn such spatial pattern during the training. To deal with this drawback, this paper proposes a novel location graph convolutional network (Location-GCN). Location-GCN solves this problem by adding a new learnable matrix into the GCN mechanism, using the absolute value of this matrix to represent the distinct influence levels among different nodes. Then, long short-term memory (LSTM) is employed in the proposed traffic prediction model. Moreover, Trigonometric function encoding is used in this study to enable the short-term input sequence to convey the long-term periodical information. Ultimately, the proposed model is compared with the baseline models and evaluated on two real word traffic flow datasets. The results show our model is more accurate and robust on both datasets than other representative traffic prediction models.  ( 2 min )
    A Verification Framework for Certifying Learning-Based Safety-Critical Aviation Systems. (arXiv:2205.04590v1 [eess.SY])
    We present a safety verification framework for design-time and run-time assurance of learning-based components in aviation systems. Our proposed framework integrates two novel methodologies. From the design-time assurance perspective, we propose offline mixed-fidelity verification tools that incorporate knowledge from different levels of granularity in simulated environments. From the run-time assurance perspective, we propose reachability- and statistics-based online monitoring and safety guards for a learning-based decision-making model to complement the offline verification methods. This framework is designed to be loosely coupled among modules, allowing the individual modules to be developed using independent methodologies and techniques, under varying circumstances and with different tool access. The proposed framework offers feasible solutions for meeting system safety requirements at different stages throughout the system development and deployment cycle, enabling the continuous learning and assessment of the system product.  ( 2 min )
    On Causality in Domain Adaptation and Semi-Supervised Learning: an Information-Theoretic Analysis. (arXiv:2205.04641v1 [cs.LG])
    The establishment of the link between causality and unsupervised domain adaptation (UDA)/semi-supervised learning (SSL) has led to methodological advances in these learning problems in recent years. However, a formal theory that explains the role of causality in the generalization performance of UDA/SSL is still lacking. In this paper, we consider the UDA/SSL setting where we access m labeled source data and n unlabeled target data as training instances under a parametric probabilistic model. We study the learning performance (e.g., excess risk) of prediction in the target domain. Specifically, we distinguish two scenarios: the learning problem is called causal learning if the feature is the cause and the label is the effect, and is called anti-causal learning otherwise. We show that in causal learning, the excess risk depends on the size of the source sample at a rate of O(1/m) only if the labelling distribution between the source and target domains remains unchanged. In anti-causal learning, we show that the unlabeled data dominate the performance at a rate of typically O(1/n). Our analysis is based on the notion of potential outcome random variables and information theory. These results bring out the relationship between the data sample size and the hardness of the learning problem with different causal mechanisms.  ( 2 min )
    Surreal-GAN:Semi-Supervised Representation Learning via GAN for uncovering heterogeneous disease-related imaging patterns. (arXiv:2205.04523v1 [cs.LG])
    A plethora of machine learning methods have been applied to imaging data, enabling the construction of clinically relevant imaging signatures of neurological and neuropsychiatric diseases. Oftentimes, such methods don't explicitly model the heterogeneity of disease effects, or approach it via nonlinear models that are not interpretable. Moreover, unsupervised methods may parse heterogeneity that is driven by nuisance confounding factors that affect brain structure or function, rather than heterogeneity relevant to a pathology of interest. On the other hand, semi-supervised clustering methods seek to derive a dichotomous subtype membership, ignoring the truth that disease heterogeneity spatially and temporally extends along a continuum. To address the aforementioned limitations, herein, we propose a novel method, termed Surreal-GAN (Semi-SUpeRvised ReprEsentAtion Learning via GAN). Using cross-sectional imaging data, Surreal-GAN dissects underlying disease-related heterogeneity under the principle of semi-supervised clustering (cluster mappings from normal control to patient), proposes a continuously dimensional representation, and infers the disease severity of patients at individual level along each dimension. The model first learns a transformation function from normal control (CN) domain to the patient (PT) domain with latent variables controlling transformation directions. An inverse mapping function together with regularization on function continuity, pattern orthogonality and monotonicity was also imposed to make sure that the transformation function captures necessarily meaningful imaging patterns with clinical significance. We first validated the model through extensive semi-synthetic experiments, and then demonstrate its potential in capturing biologically plausible imaging patterns in Alzheimer's disease (AD).  ( 2 min )
    Affective Medical Estimation and Decision Making via Visualized Learning and Deep Learning. (arXiv:2205.04599v1 [cs.LG])
    With the advent of sophisticated machine learning (ML) techniques and the promising results they yield, especially in medical applications, where they have been investigated for different tasks to enhance the decision-making process. Since visualization is such an effective tool for human comprehension, memorization, and judgment, we have presented a first-of-its-kind estimation approach we refer to as Visualized Learning for Machine Learning (VL4ML) that not only can serve to assist physicians and clinicians in making reasoned medical decisions, but it also allows to appreciate the uncertainty visualization, which could raise incertitude in making the appropriate classification or prediction. For the proof of concept, and to demonstrate the generalized nature of this visualized estimation approach, five different case studies are examined for different types of tasks including classification, regression, and longitudinal prediction. A survey analysis with more than 100 individuals is also conducted to assess users' feedback on this visualized estimation method. The experiments and the survey demonstrate the practical merits of the VL4ML that include: (1) appreciating visually clinical/medical estimations; (2) getting closer to the patients' preferences; (3) improving doctor-patient communication, and (4) visualizing the uncertainty introduced through the black box effect of the deployed ML algorithm. All the source codes are shared via a GitHub repository.  ( 2 min )
    Statistical Guarantees for Approximate Stationary Points of Simple Neural Networks. (arXiv:2205.04491v1 [cs.LG])
    Since statistical guarantees for neural networks are usually restricted to global optima of intricate objective functions, it is not clear whether these theories really explain the performances of actual outputs of neural-network pipelines. The goal of this paper is, therefore, to bring statistical theory closer to practice. We develop statistical guarantees for simple neural networks that coincide up to logarithmic factors with the global optima but apply to stationary points and the points nearby. These results support the common notion that neural networks do not necessarily need to be optimized globally from a mathematical perspective. More generally, despite being limited to simple neural networks for now, our theories make a step forward in describing the practical properties of neural networks in mathematical terms.  ( 2 min )
    Differentiable Electron Microscopy Simulation: Methods and Applications for Visualization. (arXiv:2205.04464v1 [q-bio.QM])
    We propose a new microscopy simulation system that can depict atomistic models in a micrograph visual style, similar to results of physical electron microscopy imaging. This system is scalable, able to represent simulation of electron microscopy of tens of viral particles and synthesizes the image faster than previous methods. On top of that, the simulator is differentiable, both its deterministic as well as stochastic stages that form signal and noise representations in the micrograph. This notable property has the capability for solving inverse problems by means of optimization and thus allows for generation of microscopy simulations using the parameter settings estimated from real data. We demonstrate this learning capability through two applications: (1) estimating the parameters of the modulation transfer function defining the detector properties of the simulated and real micrographs, and (2) denoising the real data based on parameters trained from the simulated examples. While current simulators do not support any parameter estimation due to their forward design, we show that the results obtained using estimated parameters are very similar to the results of real micrographs. Additionally, we evaluate the denoising capabilities of our approach and show that the results showed an improvement over state-of-the-art methods. Denoised micrographs exhibit less noise in the tilt-series tomography reconstructions, ultimately reducing the visual dominance of noise in direct volume rendering of microscopy tomograms.  ( 2 min )
    Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures. (arXiv:2205.04713v1 [cs.LG])
    With the advent of ubiquitous deployment of smart devices and the Internet of Things, data sources for machine learning inference have increasingly moved to the edge of the network. Existing machine learning inference platforms typically assume a homogeneous infrastructure and do not take into account the more complex and tiered computing infrastructure that includes edge devices, local hubs, edge datacenters, and cloud datacenters. On the other hand, recent machine learning efforts have provided viable solutions for model compression, pruning and quantization for heterogeneous environments; for a machine learning model, now we may easily find or even generate a series of models with different tradeoffs between accuracy and efficiency. We design and implement JellyBean, a framework for serving and optimizing machine learning inference workflows on heterogeneous infrastructures. Given service-level objectives (e.g., throughput, accuracy), JellyBean automatically selects the most cost-efficient models that met the accuracy target and decides how to deploy them across different tiers of infrastructures. Evaluations show that JellyBean reduces the total serving cost of visual question answering by up to 58%, and vehicle tracking from the NVIDIA AI City Challenge by up to 36% compared with state-of-the-art model selection and worker assignment solutions. JellyBean also outperforms prior ML serving systems (e.g., Spark on the cloud) up to 5x in serving costs.  ( 2 min )
    DNS based In-Browser Cryptojacking Detection. (arXiv:2205.04685v1 [cs.CR])
    The metadata aspect of Domain Names (DNs) enables us to perform a behavioral study of DNs and detect if a DN is involved in in-browser cryptojacking. Thus, we are motivated to study different temporal and behavioral aspects of DNs involved in cryptojacking. We use temporal features such as query frequency and query burst along with graph-based features such as degree and diameter, and non-temporal features such as the string-based to detect if a DNs is suspect to be involved in the in-browser cryptojacking. Then, we use them to train the Machine Learning (ML) algorithms over different temporal granularities such as 2 hours datasets and complete dataset. Our results show DecisionTrees classifier performs the best with 59.5% Recall on cryptojacked DN, while for unsupervised learning, K-Means with K=2 perform the best. Similarity analysis of the features reveals a minimal divergence between the cryptojacking DNs and other already known malicious DNs. It also reveals the need for improvements in the feature set of state-of-the-art methods to improve their accuracy in detecting in-browser cryptojacking. As added analysis, our signature-based analysis identifies that none-of-the Indian Government websites were involved in cryptojacking during October-December 2021. However, based on the resource utilization, we identify 10 DNs with different properties than others.  ( 2 min )
    Stabilized Doubly Robust Learning for Recommendation on Data Missing Not at Random. (arXiv:2205.04701v1 [cs.LG])
    In recommender systems, users always choose favorite items to rate, which results in data missing not at random and poses a great challenge for unbiased evaluation and learning of prediction models. Currently, the doubly robust (DR) method and its variants have been widely studied and demonstrate superior performance. However, we show that DR methods are unstable to extremely small propensities and rely on extrapolations, resulting in sub-optimal performances. In this paper, we propose a stabilized doubly robust (SDR) estimator to address the above limitations while retaining double robustness. Theoretical analysis shows that SDR has bounded bias, variance and generalization error bound under inaccurate imputed errors and arbitrarily small propensities. In addition, we propose a novel learning approach for SDR that updates the imputation, propensity, and prediction models cyclically, achieving more stable and accurate predictions. Extensive experiments show that our approach significantly outperforms the existing methods.  ( 2 min )
    Improving genetic risk prediction across diverse population by disentangling ancestry representations. (arXiv:2205.04673v1 [cs.LG])
    Risk prediction models using genetic data have seen increasing traction in genomics. However, most of the polygenic risk models were developed using data from participants with similar (mostly European) ancestry. This can lead to biases in the risk predictors resulting in poor generalization when applied to minority populations and admixed individuals such as African Americans. To address this bias, largely due to the prediction models being confounded by the underlying population structure, we propose a novel deep-learning framework that leverages data from diverse population and disentangles ancestry from the phenotype-relevant information in its representation. The ancestry disentangled representation can be used to build risk predictors that perform better across minority populations. We applied the proposed method to the analysis of Alzheimer's disease genetics. Comparing with standard linear and nonlinear risk prediction methods, the proposed method substantially improves risk prediction in minority populations, particularly for admixed individuals.  ( 2 min )
    Real-time Forecasting of Time Series in Financial Markets Using Sequentially Trained Many-to-one LSTMs. (arXiv:2205.04678v1 [cs.LG])
    Financial markets are highly complex and volatile; thus, learning about such markets for the sake of making predictions is vital to make early alerts about crashes and subsequent recoveries. People have been using learning tools from diverse fields such as financial mathematics and machine learning in the attempt of making trustworthy predictions on such markets. However, the accuracy of such techniques had not been adequate until artificial neural network (ANN) frameworks were developed. Moreover, making accurate real-time predictions of financial time series is highly subjective to the ANN architecture in use and the procedure of training it. Long short-term memory (LSTM) is a member of the recurrent neural network family which has been widely utilized for time series predictions. Especially, we train two LSTMs with a known length, say $T$ time steps, of previous data and predict only one time step ahead. At each iteration, while one LSTM is employed to find the best number of epochs, the second LSTM is trained only for the best number of epochs to make predictions. We treat the current prediction as in the training set for the next prediction and train the same LSTM. While classic ways of training result in more error when the predictions are made further away in the test period, our approach is capable of maintaining a superior accuracy as training increases when it proceeds through the testing period. The forecasting accuracy of our approach is validated using three time series from each of the three diverse financial markets: stock, cryptocurrency, and commodity. The results are compared with those of an extended Kalman filter, an autoregressive model, and an autoregressive integrated moving average model.  ( 2 min )
    SuMe: A Dataset Towards Summarizing Biomedical Mechanisms. (arXiv:2205.04652v1 [cs.CL])
    Can language models read biomedical texts and explain the biomedical mechanisms discussed? In this work we introduce a biomedical mechanism summarization task. Biomedical studies often investigate the mechanisms behind how one entity (e.g., a protein or a chemical) affects another in a biological context. The abstracts of these publications often include a focused set of sentences that present relevant supporting statements regarding such relationships, associated experimental evidence, and a concluding sentence that summarizes the mechanism underlying the relationship. We leverage this structure and create a summarization task, where the input is a collection of sentences and the main entities in an abstract, and the output includes the relationship and a sentence that summarizes the mechanism. Using a small amount of manually labeled mechanism sentences, we train a mechanism sentence classifier to filter a large biomedical abstract collection and create a summarization dataset with 22k instances. We also introduce conclusion sentence generation as a pretraining task with 611k instances. We benchmark the performance of large bio-domain language models. We find that while the pretraining task help improves performance, the best model produces acceptable mechanism outputs in only 32% of the instances, which shows the task presents significant challenges in biomedical language understanding and summarization.  ( 2 min )
    On some studies of Fraud Detection Pipeline and related issues from the scope of Ensemble Learning and Graph-based Learning. (arXiv:2205.04626v1 [cs.LG])
    The UK anti-fraud charity Fraud Advisory Panel (FAP) in their review of 2016 estimates business costs of fraud at 144 billion, and its individual counterpart at 9.7 billion. Banking, insurance, manufacturing, and government are the most common industries affected by fraud activities. Designing an efficient fraud detection system could avoid losing the money; however, building this system is challenging due to many difficult problems, e.g.imbalanced data, computing costs, etc. Over the last three decades, there are various research relates to fraud detection but no agreement on what is the best approach to build the fraud detection system. In this thesis, we aim to answer some questions such as i) how to build a simplified and effective Fraud Detection System that not only easy to implement but also providing reliable results and our proposed Fraud Detection Pipeline is a potential backbone of the system and is easy to be extended or upgraded, ii) when to update models in our system (and keep the accuracy stable) in order to reduce the cost of updating process, iii) how to deal with an extreme imbalance in big data classification problem, e.g. fraud detection, since this is the gap between two difficult problems, iv) further, how to apply graph-based semi-supervised learning to detect fraudulent transactions.  ( 2 min )
    Risk Aversion In Learning Algorithms and an Application To Recommendation Systems. (arXiv:2205.04619v1 [cs.LG])
    Consider a bandit learning environment. We demonstrate that popular learning algorithms such as Upper Confidence Band (UCB) and $\varepsilon$-Greedy exhibit risk aversion: when presented with two arms of the same expectation, but different variance, the algorithms tend to not choose the riskier, i.e. higher variance, arm. We prove that $\varepsilon$-Greedy chooses the risky arm with probability tending to $0$ when faced with a deterministic and a Rademacher-distributed arm. We show experimentally that UCB also shows risk-averse behavior, and that risk aversion is present persistently in early rounds of learning even if the riskier arm has a slightly higher expectation. We calibrate our model to a recommendation system and show that algorithmic risk aversion can decrease consumer surplus and increase homogeneity. We discuss several extensions to other bandit algorithms, reinforcement learning, and investigate the impacts of algorithmic risk aversion for decision theory.  ( 2 min )
    Image2Gif: Generating Continuous Realistic Animations with Warping NODEs. (arXiv:2205.04519v1 [cs.CV])
    Generating smooth animations from a limited number of sequential observations has a number of applications in vision. For example, it can be used to increase number of frames per second, or generating a new trajectory only based on first and last frames, e.g. a motion of face emotions. Despite the discrete observed data (frames), the problem of generating a new trajectory is a continues problem. In addition, to be perceptually realistic, the domain of an image should not alter drastically through the trajectory of changes. In this paper, we propose a new framework, Warping Neural ODE, for generating a smooth animation (video frame interpolation) in a continuous manner, given two ("farther apart") frames, denoting the start and the end of the animation. The key feature of our framework is utilizing the continuous spatial transformation of the image based on the vector field, derived from a system of differential equations. This allows us to achieve the smoothness and the realism of an animation with infinitely small time steps between the frames. We show the application of our work in generating an animation given two frames, in different training settings, including Generative Adversarial Network (GAN) and with $L_2$ loss.  ( 2 min )
    How Does Frequency Bias Affect the Robustness of Neural Image Classifiers against Common Corruption and Adversarial Perturbations?. (arXiv:2205.04533v1 [cs.LG])
    Model robustness is vital for the reliable deployment of machine learning models in real-world applications. Recent studies have shown that data augmentation can result in model over-relying on features in the low-frequency domain, sacrificing performance against low-frequency corruptions, highlighting a connection between frequency and robustness. Here, we take one step further to more directly study the frequency bias of a model through the lens of its Jacobians and its implication to model robustness. To achieve this, we propose Jacobian frequency regularization for models' Jacobians to have a larger ratio of low-frequency components. Through experiments on four image datasets, we show that biasing classifiers towards low (high)-frequency components can bring performance gain against high (low)-frequency corruption and adversarial perturbation, albeit with a tradeoff in performance for low (high)-frequency corruption. Our approach elucidates a more direct connection between the frequency bias and robustness of deep learning models.  ( 2 min )
    Nightly Automobile Claims Prediction from Telematics-Derived Features: A Multilevel Approach. (arXiv:2205.04616v1 [cs.LG])
    In recent years it has become possible to collect GPS data from drivers and to incorporate this data into automobile insurance pricing for the driver. This data is continuously collected and processed nightly into metadata consisting of mileage and time summaries of each discrete trip taken, and a set of behavioral scores describing attributes of the trip (e.g, driver fatigue or driver distraction) so we examine whether it can be used to identify periods of increased risk by successfully classifying trips that occur immediately before a trip in which there was an incident leading to a claim for that driver. Identification of periods of increased risk for a driver is valuable because it creates an opportunity for intervention and, potentially, avoidance of a claim. We examine metadata for each trip a driver takes and train a classifier to predict whether \textit{the following trip} is one in which a claim occurs for that driver. By achieving a area under the receiver-operator characteristic above 0.6, we show that it is possible to predict claims in advance. Additionally, we compare the predictive power, as measured by the area under the receiver-operator characteristic of XGBoost classifiers trained to predict whether a driver will have a claim using exposure features such as driven miles, and those trained using behavioral features such as a computed speed score.  ( 2 min )
    A Song of (Dis)agreement: Evaluating the Evaluation of Explainable Artificial Intelligence in Natural Language Processing. (arXiv:2205.04559v1 [cs.CL])
    There has been significant debate in the NLP community about whether or not attention weights can be used as an explanation - a mechanism for interpreting how important each input token is for a particular prediction. The validity of "attention as explanation" has so far been evaluated by computing the rank correlation between attention-based explanations and existing feature attribution explanations using LSTM-based models. In our work, we (i) compare the rank correlation between five more recent feature attribution methods and two attention-based methods, on two types of NLP tasks, and (ii) extend this analysis to also include transformer-based models. We find that attention-based explanations do not correlate strongly with any recent feature attribution methods, regardless of the model or task. Furthermore, we find that none of the tested explanations correlate strongly with one another for the transformer-based model, leading us to question the underlying assumption that we should measure the validity of attention-based explanations based on how well they correlate with existing feature attribution explanation methods. After conducting experiments on five datasets using two different models, we argue that the community should stop using rank correlation as an evaluation metric for attention-based explanations. We suggest that researchers and practitioners should instead test various explanation methods and employ a human-in-the-loop process to determine if the explanations align with human intuition for the particular use case at hand.  ( 2 min )
    Sentence-level Privacy for Document Embeddings. (arXiv:2205.04605v1 [cs.LG])
    User language data can contain highly sensitive personal content. As such, it is imperative to offer users a strong and interpretable privacy guarantee when learning from their data. In this work, we propose SentDP: pure local differential privacy at the sentence level for a single user document. We propose a novel technique, DeepCandidate, that combines concepts from robust statistics and language modeling to produce high-dimensional, general-purpose $\epsilon$-SentDP document embeddings. This guarantees that any single sentence in a document can be substituted with any other sentence while keeping the embedding $\epsilon$-indistinguishable. Our experiments indicate that these private document embeddings are useful for downstream tasks like sentiment analysis and topic classification and even outperform baseline methods with weaker guarantees like word-level Metric DP.  ( 2 min )
    Towards Optimal VPU Compiler Cost Modeling by using Neural Networks to Infer Hardware Performances. (arXiv:2205.04586v1 [cs.LG])
    Calculating the most efficient schedule of work in a neural network compiler is a difficult task. There are many parameters to be accounted for that can positively or adversely affect that schedule depending on their configuration - How work is shared between distributed targets, the subdivision of tensors to fit in memory, toggling the enablement of optimizations, etc. Traditionally, neural network compilers determine how to set these values by building a graph of choices and choosing the path with minimal 'cost'. These choices and their corresponding costs are usually determined by an algorithm crafted by engineers with a deep knowledge of the target platform. However, when the amount of options available to a compiler is large, it is very difficult to ensure that these models consistently produce an optimal schedule for all scenarios, whilst still completing compilation in an acceptable timeframe. This paper presents 'VPUNN' - a neural network-based cost model trained on low-level task profiling that consistently outperforms the state-of-the-art cost modeling in Intel's line of VPU processors.  ( 2 min )
    KEMP: Keyframe-Based Hierarchical End-to-End Deep Model for Long-Term Trajectory Prediction. (arXiv:2205.04624v1 [cs.CV])
    Predicting future trajectories of road agents is a critical task for autonomous driving. Recent goal-based trajectory prediction methods, such as DenseTNT and PECNet, have shown good performance on prediction tasks on public datasets. However, they usually require complicated goal-selection algorithms and optimization. In this work, we propose KEMP, a hierarchical end-to-end deep learning framework for trajectory prediction. At the core of our framework is keyframe-based trajectory prediction, where keyframes are representative states that trace out the general direction of the trajectory. KEMP first predicts keyframes conditioned on the road context, and then fills in intermediate states conditioned on the keyframes and the road context. Under our general framework, goal-conditioned methods are special cases in which the number of keyframes equal to one. Unlike goal-conditioned methods, our keyframe predictor is learned automatically and does not require hand-crafted goal-selection algorithms. We evaluate our model on public benchmarks and our model ranked 1st on Waymo Open Motion Dataset Leaderboard (as of September 1, 2021).  ( 2 min )
    Calibrating for Class Weights by Modeling Machine Learning. (arXiv:2205.04613v1 [cs.LG])
    A much studied issue is the extent to which the confidence scores provided by machine learning algorithms are calibrated to ground truth probabilities. Our starting point is that calibration is seemingly incompatible with class weighting, a technique often employed when one class is less common (class imbalance) or with the hope of achieving some external objective (cost-sensitive learning). We provide a model-based explanation for this incompatibility and use our anthropomorphic model to generate a simple method of recovering likelihoods from an algorithm that is miscalibrated due to class weighting. We validate this approach in the binary pneumonia detection task of Rajpurkar, Irvin, Zhu, et al. (2017).  ( 2 min )
    Towards Intersectionality in Machine Learning: Including More Identities, Handling Underrepresentation, and Performing Evaluation. (arXiv:2205.04610v1 [cs.LG])
    Research in machine learning fairness has historically considered a single binary demographic attribute; however, the reality is of course far more complicated. In this work, we grapple with questions that arise along three stages of the machine learning pipeline when incorporating intersectionality as multiple demographic attributes: (1) which demographic attributes to include as dataset labels, (2) how to handle the progressively smaller size of subgroups during model training, and (3) how to move beyond existing evaluation metrics when benchmarking model fairness for more subgroups. For each question, we provide thorough empirical evaluation on tabular datasets derived from the US Census, and present constructive recommendations for the machine learning community. First, we advocate for supplementing domain knowledge with empirical validation when choosing which demographic attribute labels to train on, while always evaluating on the full set of demographic attributes. Second, we warn against using data imbalance techniques without considering their normative implications and suggest an alternative using the structure in the data. Third, we introduce new evaluation metrics which are more appropriate for the intersectional setting. Overall, we provide substantive suggestions on three necessary (albeit not sufficient!) considerations when incorporating intersectionality into machine learning.  ( 2 min )
    Rethinking Fairness: An Interdisciplinary Survey of Critiques of Hegemonic ML Fairness Approaches. (arXiv:2205.04460v1 [cs.LG])
    This survey article assesses and compares existing critiques of current fairness-enhancing technical interventions into machine learning (ML) that draw from a range of non-computing disciplines, including philosophy, feminist studies, critical race and ethnic studies, legal studies, anthropology, and science and technology studies. It bridges epistemic divides in order to offer an interdisciplinary understanding of the possibilities and limits of hegemonic computational approaches to ML fairness for producing just outcomes for society's most marginalized. The article is organized according to nine major themes of critique wherein these different fields intersect: 1) how "fairness" in AI fairness research gets defined; 2) how problems for AI systems to address get formulated; 3) the impacts of abstraction on how AI tools function and its propensity to lead to technological solutionism; 4) how racial classification operates within AI fairness research; 5) the use of AI fairness measures to avoid regulation and engage in ethics washing; 6) an absence of participatory design and democratic deliberation in AI fairness considerations; 7) data collection practices that entrench "bias," are non-consensual, and lack transparency; 8) the predatory inclusion of marginalized groups into AI systems; and 9) a lack of engagement with AI's long-term social and ethical outcomes. Drawing from these critiques, the article concludes by imagining future ML fairness research directions that actively disrupt entrenched power dynamics and structural injustices in society.  ( 2 min )
    A Probabilistic Generative Model of Free Categories. (arXiv:2205.04545v1 [cs.AI])
    Applied category theory has recently developed libraries for computing with morphisms in interesting categories, while machine learning has developed ways of learning programs in interesting languages. Taking the analogy between categories and languages seriously, this paper defines a probabilistic generative model of morphisms in free monoidal categories over domain-specific generating objects and morphisms. The paper shows how acyclic directed wiring diagrams can model specifications for morphisms, which the model can use to generate morphisms. Amortized variational inference in the generative model then enables learning of parameters (by maximum likelihood) and inference of latent variables (by Bayesian inversion). A concrete experiment shows that the free category prior achieves competitive reconstruction performance on the Omniglot dataset.  ( 2 min )
    Towards a multi-stakeholder value-based assessment framework for algorithmic systems. (arXiv:2205.04525v1 [cs.LG])
    In an effort to regulate Machine Learning-driven (ML) systems, current auditing processes mostly focus on detecting harmful algorithmic biases. While these strategies have proven to be impactful, some values outlined in documents dealing with ethics in ML-driven systems are still underrepresented in auditing processes. Such unaddressed values mainly deal with contextual factors that cannot be easily quantified. In this paper, we develop a value-based assessment framework that is not limited to bias auditing and that covers prominent ethical principles for algorithmic systems. Our framework presents a circular arrangement of values with two bipolar dimensions that make common motivations and potential tensions explicit. In order to operationalize these high-level principles, values are then broken down into specific criteria and their manifestations. However, some of these value-specific criteria are mutually exclusive and require negotiation. As opposed to some other auditing frameworks that merely rely on ML researchers' and practitioners' input, we argue that it is necessary to include stakeholders that present diverse standpoints to systematically negotiate and consolidate value and criteria tensions. To that end, we map stakeholders with different insight needs, and assign tailored means for communicating value manifestations to them. We, therefore, contribute to current ML auditing practices with an assessment framework that visualizes closeness and tensions between values and we give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.  ( 2 min )
    Selectively Contextual Bandits. (arXiv:2205.04528v1 [cs.LG])
    Contextual bandits are widely used in industrial personalization systems. These online learning frameworks learn a treatment assignment policy in the presence of treatment effects that vary with the observed contextual features of the users. While personalization creates a rich user experience that reflect individual interests, there are benefits of a shared experience across a community that enable participation in the zeitgeist. Such benefits are emergent through network effects and are not captured in regret metrics typically employed in evaluating bandits. To balance these needs, we propose a new online learning algorithm that preserves benefits of personalization while increasing the commonality in treatments across users. Our approach selectively interpolates between a contextual bandit algorithm and a context-free multi-arm bandit and leverages the contextual information for a treatment decision only if it promises significant gains. Apart from helping users of personalization systems balance their experience between the individualized and shared, simplifying the treatment assignment policy by making it selectively reliant on the context can help improve the rate of learning in some cases. We evaluate our approach in a classification setting using public datasets and show the benefits of the hybrid policy.  ( 2 min )
    PinnerFormer: Sequence Modeling for User Representation at Pinterest. (arXiv:2205.04507v1 [cs.LG])
    Sequential models have become increasingly popular in powering personalized recommendation systems over the past several years. These approaches traditionally model a user's actions on a website as a sequence to predict the user's next action. While theoretically simplistic, these models are quite challenging to deploy in production, commonly requiring streaming infrastructure to reflect the latest user activity and potentially managing mutable data for encoding a user's hidden state. Here we introduce PinnerFormer, a user representation trained to predict a user's future long-term engagement using a sequential model of a user's recent actions. Unlike prior approaches, we adapt our modeling to a batch infrastructure via our new dense all-action loss, modeling long-term future actions instead of next action prediction. We show that by doing so, we significantly close the gap between batch user embeddings that are generated once a day and realtime user embeddings generated whenever a user takes an action. We describe our design decisions via extensive offline experimentation and ablations and validate the efficacy of our approach in A/B experiments showing substantial improvements in Pinterest's user retention and engagement when comparing PinnerFormer against our previous user representation. PinnerFormer is deployed in production as of Fall 2021.  ( 2 min )
    Insights into the origin of halo mass profiles from machine learning. (arXiv:2205.04474v1 [astro-ph.CO])
    The mass distribution of dark matter haloes is the result of the hierarchical growth of initial density perturbations through mass accretion and mergers. We use an interpretable machine-learning framework to provide physical insights into the origin of the spherically-averaged mass profile of dark matter haloes. We train a gradient-boosted-trees algorithm to predict the final mass profiles of cluster-sized haloes, and measure the importance of the different inputs provided to the algorithm. We find two primary scales in the initial conditions (ICs) that impact the final mass profile: the density at approximately the scale of the haloes' Lagrangian patch $R_L$ ($R\sim 0.7\, R_L$) and that in the large-scale environment ($R\sim 1.7~R_L$). The model also identifies three primary time-scales in the halo assembly history that affect the final profile: (i) the formation time of the virialized, collapsed material inside the halo, (ii) the dynamical time, which captures the dynamically unrelaxed, infalling component of the halo over its first orbit, (iii) a third, most recent time-scale, which captures the impact on the outer profile of recent massive merger events. While the inner profile retains memory of the ICs, this information alone is insufficient to yield accurate predictions for the outer profile. As we add information about the haloes' mass accretion history, we find a significant improvement in the predicted profiles at all radii. Our machine-learning framework provides novel insights into the role of the ICs and the mass assembly history in determining the final mass profile of cluster-sized haloes.  ( 2 min )
    OTFPF: Optimal Transport-Based Feature Pyramid Fusion Network for Brain Age Estimation with 3D Overlapped ConvNeXt. (arXiv:2205.04684v1 [cs.CV])
    Chronological age of healthy brain is able to be predicted using deep neural networks from T1-weighted magnetic resonance images (T1 MRIs), and the predicted brain age could serve as an effective biomarker for detecting aging-related diseases or disorders. In this paper, we propose an end-to-end neural network architecture, referred to as optimal transport based feature pyramid fusion (OTFPF) network, for the brain age estimation with T1 MRIs. The OTFPF consists of three types of modules: Optimal Transport based Feature Pyramid Fusion (OTFPF) module, 3D overlapped ConvNeXt (3D OL-ConvNeXt) module and fusion module. These modules strengthen the OTFPF network's understanding of each brain's semi-multimodal and multi-level feature pyramid information, and significantly improve its estimation performances. Comparing with recent state-of-the-art models, the proposed OTFPF converges faster and performs better. The experiments with 11,728 MRIs aged 3-97 years show that OTFPF network could provide accurate brain age estimation, yielding mean absolute error (MAE) of 2.097, Pearson's correlation coefficient (PCC) of 0.993 and Spearman's rank correlation coefficient (SRCC) of 0.989, between the estimated and chronological ages. Widespread quantitative experiments and ablation experiments demonstrate the superiority and rationality of OTFPF network. The codes and implement details will be released on GitHub: https://github.com/ZJU-Brain/OTFPF after final decision.  ( 2 min )
    Variational Inference MPC using Normalizing Flows and Out-of-Distribution Projection. (arXiv:2205.04667v1 [cs.RO])
    We propose a Model Predictive Control (MPC) method for collision-free navigation that uses amortized variational inference to approximate the distribution of optimal control sequences by training a normalizing flow conditioned on the start, goal and environment. This representation allows us to learn a distribution that accounts for both the dynamics of the robot and complex obstacle geometries. We can then sample from this distribution to produce control sequences which are likely to be both goal-directed and collision-free as part of our proposed FlowMPPI sampling-based MPC method. However, when deploying this method, the robot may encounter an out-of-distribution (OOD) environment, i.e. one which is radically different from those used in training. In such cases, the learned flow cannot be trusted to produce low-cost control sequences. To generalize our method to OOD environments we also present an approach that performs projection on the representation of the environment as part of the MPC process. This projection changes the environment representation to be more in-distribution while also optimizing trajectory quality in the true environment. Our simulation results on a 2D double-integrator and a 3D 12DoF underactuated quadrotor suggest that FlowMPPI with projection outperforms state-of-the-art MPC baselines on both in-distribution and OOD environments, including OOD environments generated from real-world data.  ( 2 min )
    Crypto Pump and Dump via Deep Learning Techniques. (arXiv:2205.04646v1 [cs.LG])
    Despite the fact that cryptocurrencies themselves have experienced an astonishing rate of adoption over the last decade, cryptocurrency fraud detection is a heavily under-researched problem area. Of all fraudulent activity regarding cryptocurrencies, pump and dump schemes are some of the most common. Though some studies have been done on these kinds of scams in the stock market, the lack of labelled stock data and the volatility unique to the cryptocurrency space constrains the applicability of studies on the stock market toward this problem domain. Furthermore, the only work done in this space thus far has been either statistical in nature, or has been concerned with classical machine learning models such as random forest trees. We propose the novel application of two existing neural network architectures to this problem domain and show that deep learning solutions can significantly outperform all other existing pump and dump detection methods for cryptocurrencies.  ( 2 min )
    A 14uJ/Decision Keyword Spotting Accelerator with In-SRAM-Computing and On Chip Learning for Customization. (arXiv:2205.04665v1 [cs.AR])
    Keyword spotting has gained popularity as a natural way to interact with consumer devices in recent years. However, because of its always-on nature and the variety of speech, it necessitates a low-power design as well as user customization. This paper describes a low-power, energy-efficient keyword spotting accelerator with SRAM based in-memory computing (IMC) and on-chip learning for user customization. However, IMC is constrained by macro size, limited precision, and non-ideal effects. To address the issues mentioned above, this paper proposes bias compensation and fine-tuning using an IMC-aware model design. Furthermore, because learning with low-precision edge devices results in zero error and gradient values due to quantization, this paper proposes error scaling and small gradient accumulation to achieve the same accuracy as ideal model training. The simulation results show that with user customization, we can recover the accuracy loss from 51.08\% to 89.76\% with compensation and fine-tuning and further improve to 96.71\% with customization. The chip implementation can successfully run the model with only 14$uJ$ per decision. When compared to the state-of-the-art works, the presented design has higher energy efficiency with additional on-chip model customization capabilities for higher accuracy.  ( 2 min )
    An Edge-Cloud Integrated Framework for Flexible and Dynamic Stream Analytics. (arXiv:2205.04622v1 [cs.DC])
    With the popularity of Internet of Things (IoT), edge computing and cloud computing, more and more stream analytics applications are being developed including real-time trend prediction and object detection on top of IoT sensing data. One popular type of stream analytics is the recurrent neural network (RNN) deep learning model based time series or sequence data prediction and forecasting. Different from traditional analytics that assumes data to be processed are available ahead of time and will not change, stream analytics deals with data that are being generated continuously and data trend/distribution could change (aka concept drift), which will cause prediction/forecasting accuracy to drop over time. One other challenge is to find the best resource provisioning for stream analytics to achieve good overall latency. In this paper, we study how to best leverage edge and cloud resources to achieve better accuracy and latency for RNN-based stream analytics. We propose a novel edge-cloud integrated framework for hybrid stream analytics that support low latency inference on the edge and high capacity training on the cloud. We study the flexible deployment of our hybrid learning framework, namely edge-centric, cloud-centric and edge-cloud integrated. Further, our hybrid learning framework can dynamically combine inference results from an RNN model pre-trained based on historical data and another RNN model re-trained periodically based on the most recent data. Using real-world and simulated stream datasets, our experiments show the proposed edge-cloud deployment is the best among all three deployment types in terms of latency. For accuracy, the experiments show our dynamic learning approach performs the best among all learning approaches for all three concept drift scenarios.  ( 2 min )
    Robust Learning of Parsimonious Deep Neural Networks. (arXiv:2205.04650v1 [cs.LG])
    We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network during the early stages of training. Thus, the computational cost of subsequent training iterations, besides that of inference, is considerably reduced. Our method, based on variational inference principles, learns the posterior distribution of Bernoulli random variables multiplying the units/filters similarly to adaptive dropout. We derive a novel hyper-prior distribution over the prior parameters that is crucial for their optimal selection in a way that the Bernoulli parameters practically converge to either 0 or 1 establishing a deterministic final network. Our algorithm is robust in the sense that it achieves consistent pruning levels and prediction accuracy regardless of weight initialization or the size of the starting network. We provide an analysis of its convergence properties establishing theoretical and practical pruning conditions. We evaluate the proposed algorithm on the MNIST data set and commonly used fully connected and convolutional LeNet architectures. The simulations show that our method achieves pruning levels on par with state-of the-art methods for structured pruning, while maintaining better test-accuracy and more importantly in a manner robust with respect to network initialization and initial size.  ( 2 min )
    Real-Time Wearable Gait Phase Segmentation For Running And Walking. (arXiv:2205.04668v1 [cs.LG])
    Previous gait phase detection as convolutional neural network (CNN) based classification task requires cumbersome manual setting of time delay or heavy overlapped sliding windows to accurately classify each phase under different test cases, which is not suitable for streaming Inertial-Measurement-Unit (IMU) sensor data and fails to adapt to different scenarios. This paper presents a segmentation based gait phase detection with only a single six-axis IMU sensor, which can easily adapt to both walking and running at various speeds. The proposed segmentation uses CNN with gait phase aware receptive field setting and IMU oriented processing order, which can fit to high sampling rate of IMU up to 1000Hz for high accuracy and low sampling rate down to 20Hz for real time calculation. The proposed model on the 20Hz sampling rate data can achieve average error of 8.86 ms in swing time, 9.12 ms in stance time and 96.44\% accuracy of gait phase detection and 99.97\% accuracy of stride detection. Its real-time implementation on mobile phone only takes 36 ms for 1 second length of sensor data.  ( 2 min )
    Deep Gait Tracking With Inertial Measurement Unit. (arXiv:2205.04666v1 [cs.LG])
    This paper presents a convolutional neural network based foot motion tracking with only six-axis Inertial-Measurement-Unit (IMU) sensor data. The presented approach can adapt to various walking conditions by adopting differential and window based input. The training data are further augmented by sliding and random window samplings on IMU sensor data to increase data diversity for better performance. The proposed approach fuses predictions of three dimensional output into one model. The proposed fused model can achieve average error of 2.30+-2.23 cm in X-axis, 0.91+-0.95 cm in Y-axis and 0.58+-0.52 cm in Z-axis.  ( 2 min )
  • Open

    Flexible variable selection in the presence of missing data. (arXiv:2202.12989v2 [stat.ME] UPDATED)
    In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose several nonparametric variable selection algorithms combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithms that achieve control of commonly used error rates. Through simulations, we show that our proposals have good operating characteristics and result in panels with higher classification performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed methods to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.  ( 2 min )
    Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path. (arXiv:2106.02073v4 [cs.LG] UPDATED)
    The recently discovered Neural Collapse (NC) phenomenon occurs pervasively in today's deep net training paradigm of driving cross-entropy (CE) loss towards zero. During NC, last-layer features collapse to their class-means, both classifiers and class-means collapse to the same Simplex Equiangular Tight Frame, and classifier behavior collapses to the nearest-class-mean decision rule. Recent works demonstrated that deep nets trained with mean squared error (MSE) loss perform comparably to those trained with CE. As a preliminary, we empirically establish that NC emerges in such MSE-trained deep nets as well through experiments on three canonical networks and five benchmark datasets. We provide, in a Google Colab notebook, PyTorch code for reproducing MSE-NC and CE-NC: at https://colab.research.google.com/github/neuralcollapse/neuralcollapse/blob/main/neuralcollapse.ipynb. The analytically-tractable MSE loss offers more mathematical opportunities than the hard-to-analyze CE loss, inspiring us to leverage MSE loss towards the theoretical investigation of NC. We develop three main contributions: (I) We show a new decomposition of the MSE loss into (A) terms directly interpretable through the lens of NC and which assume the last-layer classifier is exactly the least-squares classifier; and (B) a term capturing the deviation from this least-squares classifier. (II) We exhibit experiments on canonical datasets and networks demonstrating that term-(B) is negligible during training. This motivates us to introduce a new theoretical construct: the central path, where the linear classifier stays MSE-optimal for feature activations throughout the dynamics. (III) By studying renormalized gradient flow along the central path, we derive exact dynamics that predict NC.  ( 3 min )
    Nested conformal prediction and quantile out-of-bag ensemble methods. (arXiv:1910.10562v4 [stat.ME] UPDATED)
    Conformal prediction is a popular tool for providing valid prediction sets for classification and regression problems, without relying on any distributional assumptions on the data. While the traditional description of conformal prediction starts with a nonconformity score, we provide an alternate (but equivalent) view that starts with a sequence of nested sets and calibrates them to find a valid prediction set. The nested framework subsumes all nonconformity scores, including recent proposals based on quantile regression and density estimation. While these ideas were originally derived based on sample splitting, our framework seamlessly extends them to other aggregation schemes like cross-conformal, jackknife+ and out-of-bag methods. We use the framework to derive a new algorithm (QOOB, pronounced cube) that combines four ideas: quantile regression, cross-conformalization, ensemble methods and out-of-bag predictions. We develop a computationally efficient implementation of cross-conformal, that is also used by QOOB. In a detailed numerical investigation, QOOB performs either the best or close to the best on all simulated and real datasets. Code for QOOB is available at https://github.com/aigen/QOOB.  ( 2 min )
    Adaptation Strategies for Automated Machine Learning on Evolving Data. (arXiv:2006.06480v3 [cs.LG] UPDATED)
    Automated Machine Learning (AutoML) systems have been shown to efficiently build good models for new datasets. However, it is often not clear how well they can adapt when the data evolves over time. The main goal of this study is to understand the effect of data stream challenges such as concept drift on the performance of AutoML methods, and which adaptation strategies can be employed to make them more robust. To that end, we propose 6 concept drift adaptation strategies and evaluate their effectiveness on different AutoML approaches. We do this for a variety of AutoML approaches for building machine learning pipelines, including those that leverage Bayesian optimization, genetic programming, and random search with automated stacking. These are evaluated empirically on real-world and synthetic data streams with different types of concept drift. Based on this analysis, we propose ways to develop more sophisticated and robust AutoML techniques.  ( 2 min )
    Mean Estimation from One-Bit Measurements. (arXiv:1901.03403v4 [cs.IT] UPDATED)
    We consider the problem of estimating the mean of a symmetric log-concave distribution under the constraint that only a single bit per sample from this distribution is available to the estimator. We study the mean squared error as a function of the sample size (and hence the number of bits). We consider three settings: first, a centralized setting, where an encoder may release $n$ bits given a sample of size $n$, and for which there is no asymptotic penalty for quantization; second, an adaptive setting in which each bit is a function of the current observation and previously recorded bits, where we show that the optimal relative efficiency compared to the sample mean is precisely the efficiency of the median; lastly, we show that in a distributed setting where each bit is only a function of a local sample, no estimator can achieve optimal efficiency uniformly over the parameter space. We additionally complement our results in the adaptive setting by showing that \emph{one} round of adaptivity is sufficient to achieve optimal mean-square error.  ( 2 min )
    A High Throughput Generative Vector Autoregression Model for Stochastic Synapses. (arXiv:2205.05053v1 [cs.NE])
    By imitating the synaptic connectivity and plasticity of the brain, emerging electronic nanodevices offer new opportunities as the building blocks of neuromorphic systems. One challenge for largescale simulations of computational architectures based on emerging devices is to accurately capture device response, hysteresis, noise, and the covariance structure in the temporal domain as well as between the different device parameters. We address this challenge with a high throughput generative model for synaptic arrays that is based on a recently available type of electrical measurement data for resistive memory cells. We map this real world data onto a vector autoregressive stochastic process to accurately reproduce the device parameters and their cross-correlation structure. While closely matching the measured data, our model is still very fast; we provide parallelized implementations for both CPUs and GPUs and demonstrate array sizes above one billion cells and throughputs exceeding one hundred million weight updates per second, above the pixel rate of a 30 frames/s 4K video stream.  ( 2 min )
    Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting. (arXiv:2010.04456v6 [stat.ML] UPDATED)
    Forecasting complex dynamical phenomena in settings where only partial knowledge of their dynamics is available is a prevalent problem across various scientific fields. While purely data-driven approaches are arguably insufficient in this context, standard physical modeling based approaches tend to be over-simplistic, inducing non-negligible errors. In this work, we introduce the APHYNITY framework, a principled approach for augmenting incomplete physical dynamics described by differential equations with deep data-driven models. It consists in decomposing the dynamics into two components: a physical component accounting for the dynamics for which we have some prior knowledge, and a data-driven component accounting for errors of the physical model. The learning problem is carefully formulated such that the physical model explains as much of the data as possible, while the data-driven component only describes information that cannot be captured by the physical model, no more, no less. This not only provides the existence and uniqueness for this decomposition, but also ensures interpretability and benefits generalization. Experiments made on three important use cases, each representative of a different family of phenomena, i.e. reaction-diffusion equations, wave equations and the non-linear damped pendulum, show that APHYNITY can efficiently leverage approximate physical models to accurately forecast the evolution of the system and correctly identify relevant physical parameters. Code is available at https://github.com/yuan-yin/APHYNITY .  ( 3 min )
    A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data. (arXiv:2201.12020v2 [stat.ML] UPDATED)
    This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A classical imputation method, the Expectation Maximization (EM) algorithm for Gaussian mixture models, has shown interesting properties when compared to other popular approaches such as those based on k-nearest neighbors or on multiple imputations by chained equations. However, Gaussian mixture models are known to be non-robust to heterogeneous data, which can lead to poor estimation performance when the data is contaminated by outliers or follows non-Gaussian distributions. To overcome this issue, a new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. This paper shows that this problem reduces to the estimation of a mixture of Angular Gaussian distributions under generic assumptions (i.e., each sample is drawn from a mixture of elliptical distributions, which is possibly different for one sample to another). In that case, the complete-data likelihood associated with mixtures of elliptical distributions is well adapted to the EM framework with missing data thanks to its conditional distribution, which is shown to be a multivariate $t$-distribution. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data. Furthermore, experiments conducted on real-world datasets show that this algorithm is very competitive when compared to other classical imputation methods.  ( 2 min )
    Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation. (arXiv:2102.07301v2 [cs.LG] UPDATED)
    We study reinforcement learning in an infinite-horizon average-reward setting with linear function approximation, where the transition probability function of the underlying Markov Decision Process (MDP) admits a linear form over a feature mapping of the current state, action, and next state. We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation. We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $\tilde{O}(d\sqrt{DT})$, where $d$ is the dimension of the feature mapping, $T$ is the horizon, and $\sqrt{D}$ is the diameter of the MDP. We also prove a matching lower bound $\tilde{\Omega}(d\sqrt{DT})$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors. To the best of our knowledge, our algorithm is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting.  ( 2 min )
    Modeling Regime Shifts in Multiple Time Series. (arXiv:2109.09692v3 [cs.LG] UPDATED)
    We investigate the problem of discovering and modeling regime shifts in an ecosystem comprising multiple time series known as co-evolving time series. Regime shifts refer to the changing behaviors exhibited by series at different time intervals. Learning these changing behaviors is a key step toward time series forecasting. While advances have been made, existing methods suffer from one or more of the following shortcomings: (1) failure to take relationships between time series into consideration for discovering regimes in multiple time series; (2) lack of an effective approach that models time-dependent behaviors exhibited by series; (3) difficulties in handling data discontinuities which may be informative. Most of the existing methods are unable to handle all of these three issues in a unified framework. This, therefore, motivates our effort to devise a principled approach for modeling interactions and time-dependency in co-evolving time series. Specifically, we model an ecosystem of multiple time series by summarizing the heavy ensemble of time series into a lighter and more meaningful structure called a \textit{mapping grid}. By using the mapping grid, our model first learns time series behavioral dependencies through a dynamic network representation, then learns the regime transition mechanism via a full time-dependent Cox regression model. The originality of our approach lies in modeling interactions between time series in regime identification and in modeling time-dependent regime transition probabilities, usually assumed to be static in existing work.  ( 2 min )
    On the power of conditional independence testing under model-X. (arXiv:2005.05506v4 [math.ST] UPDATED)
    For testing conditional independence (CI) of a response Y and a predictor X given covariates Z, the recently introduced model-X (MX) framework has been the subject of active methodological research, especially in the context of MX knockoffs and their successful application to genome-wide association studies. In this paper, we study the power of MX CI tests, yielding quantitative explanations for empirically observed phenomena and novel insights to guide the design of MX methodology. We show that any valid MX CI test must also be valid conditionally on Y and Z; this conditioning allows us to reformulate the problem as testing a point null hypothesis involving the conditional distribution of X. The Neyman-Pearson lemma then implies that the conditional randomization test (CRT) based on a likelihood statistic is the most powerful MX CI test against a point alternative. We also obtain a related optimality result for MX knockoffs. Switching to an asymptotic framework with arbitrarily growing covariate dimension, we derive an expression for the limiting power of the CRT against local semiparametric alternatives in terms of the prediction error of the machine learning algorithm on which its test statistic is based. Finally, we exhibit a resampling-free test with uniform asymptotic Type-I error control under the assumption that only the first two moments of X given Z are known, a significant relaxation of the MX assumption.  ( 2 min )
    Gradient flows on graphons: existence, convergence, continuity equations. (arXiv:2111.09459v2 [math.PR] UPDATED)
    Wasserstein gradient flows on probability measures have found a host of applications in various optimization problems. They typically arise as the continuum limit of exchangeable particle systems evolving by some mean-field interaction involving a gradient-type potential. However, in many problems, such as in multi-layer neural networks, the so-called particles are edge weights on large graphs whose nodes are exchangeable. Such large graphs are known to converge to continuum limits called graphons as their size grow to infinity. We show that the Euclidean gradient flow of a suitable function of the edge-weights converges to a novel continuum limit given by a curve on the space of graphons that can be appropriately described as a gradient flow or, more technically, a curve of maximal slope. Several natural functions on graphons, such as homomorphism functions and the scalar entropy, are covered by our set-up, and the examples have been worked out in detail.  ( 2 min )
    Representing Hierarchical Structure by Using Cone Embedding. (arXiv:2102.08014v2 [cs.AI] UPDATED)
    Graph embedding is becoming an important method with applications in various areas, including social networks and knowledge graph completion. In particular, Poincar\'e embedding has been proposed to capture the hierarchical structure of graphs, and its effectiveness has been reported. However, most of the existing methods have isometric mappings in the embedding space, and the choice of the origin point can be arbitrary. This fact is not desirable when the distance from the origin is used as an indicator of hierarchy, as in the case of Poincar\'e embedding. In this paper, we propose cone embedding, embedding method in a metric cone, which solve these problems, and we gain further benefits: 1) we provide an indicator of hierarchical information that is both geometrically and intuitively natural to interpret, and 2) we can extract the hierarchical structure from a graph embedding output of other methods by learning additional one-dimensional parameters.  ( 2 min )
    Community Detection with a Subsampled Semidefinite Program. (arXiv:2102.01419v3 [math.OC] UPDATED)
    Semidefinite programming is an important tool to tackle several problems in data science and signal processing, including clustering and community detection. However, semidefinite programs are often slow in practice, so speed up techniques such as sketching are often considered. In the context of community detection in the stochastic block model, Mixon and Xie \cite{mixon2020sketching} have recently proposed a sketching framework in which a semidefinite program is solved only on a subsampled subgraph of the network, giving rise to significant computational savings. In this short paper, we provide a positive answer to a conjecture of Mixon and Xie about the statistical limits of this technique for the stochastic block model with two balanced communities.  ( 2 min )
    A Wasserstein distance approach for concentration of empirical risk estimates. (arXiv:1902.10709v4 [math.ST] UPDATED)
    This paper presents a unified approach based on Wasserstein distance to derive concentration bounds for empirical estimates for two broad classes of risk measures defined in the paper. The classes of risk measures introduced include as special cases well known risk measures from the finance literature such as conditional value at risk (CVaR), optimized certainty equivalent risk, spectral risk measures, utility-based shortfall risk, cumulative prospect theory (CPT) value, rank dependent expected utility and distorted risk measures. Two estimation schemes are considered, one for each class of risk measures. One estimation scheme involves applying the risk measure to the empirical distribution function formed from a collection of i.i.d. samples of the random variable (r.v.), while the second scheme involves applying the same procedure to a truncated sample. The bounds provided apply to three popular classes of distributions, namely sub-Gaussian, sub-exponential and heavy-tailed distributions. The bounds are derived by first relating the estimation error to the Wasserstein distance between the true and empirical distributions, and then using recent concentration bounds for the latter. Previous concentration bounds are available only for specific risk measures such as CVaR and CPT-value. The bounds derived in this paper are shown to either match or improve upon previous bounds in cases where they are available. The usefulness of the bounds is illustrated through an algorithm and the corresponding regret bound for a stochastic bandit problem involving a general risk measure from each of the two classes introduced in the paper.  ( 2 min )
    Differentially Private Learning with Adaptive Clipping. (arXiv:1905.03871v5 [cs.LG] UPDATED)
    Existing approaches for training neural networks with user-level differential privacy (e.g., DP Federated Averaging) in federated learning (FL) settings involve bounding the contribution of each user's model update by clipping it to some constant value. However there is no good a priori setting of the clipping norm across tasks and learning settings: the update norm distribution depends on the model architecture and loss, the amount of data on each device, the client learning rate, and possibly various other parameters. We propose a method wherein instead of a fixed clipping norm, one clips to a value at a specified quantile of the update norm distribution, where the value at the quantile is itself estimated online, with differential privacy. The method tracks the quantile closely, uses a negligible amount of privacy budget, is compatible with other federated learning technologies such as compression and secure aggregation, and has a straightforward joint DP analysis with DP-FedAvg. Experiments demonstrate that adaptive clipping to the median update norm works well across a range of realistic federated learning tasks, sometimes outperforming even the best fixed clip chosen in hindsight, and without the need to tune any clipping hyperparameter.  ( 2 min )
    Random Forests for Change Point Detection. (arXiv:2205.04997v1 [stat.ME])
    We propose a novel multivariate nonparametric multiple change point detection method using classifiers. We construct a classifier log-likelihood ratio that uses class probability predictions to compare different change point configurations. We propose a computationally feasible search method that is particularly well suited for random forests, denoted by changeforest. However, the method can be paired with any classifier that yields class probability predictions, which we illustrate by also using a k-nearest neighbor classifier. We provide theoretical results motivating our choices. In a large simulation study, our proposed changeforest method achieves improved empirical performance compared to existing multivariate nonparametric change point detection methods. An efficient implementation of our method is made available for R, Python, and Rust users in the changeforest software package.  ( 2 min )
    Moving Beyond Sub-Gaussianity in High-Dimensional Statistics: Applications in Covariance Estimation and Linear Regression. (arXiv:1804.02605v4 [math.ST] UPDATED)
    Concentration inequalities form an essential toolkit in the study of high dimensional (HD) statistical methods. Most of the relevant statistics literature in this regard is based on sub-Gaussian or sub-exponential tail assumptions. In this paper, we first bring together various probabilistic inequalities for sums of independent random variables under much more general exponential type (namely sub-Weibull) tail assumptions. These results extract a part sub-Gaussian tail behavior in finite samples, matching the asymptotics governed by the central limit theorem, and are compactly represented in terms of a new Orlicz quasi-norm - the Generalized Bernstein-Orlicz norm - that typifies such tail behaviors. We illustrate the usefulness of these inequalities through the analysis of four fundamental problems in HD statistics. In the first two problems, we study the rate of convergence of the sample covariance matrix in terms of the maximum elementwise norm and the maximum k-sub-matrix operator norm which are key quantities of interest in bootstrap, HD covariance matrix estimation and HD inference. The third example concerns the restricted eigenvalue condition, required in HD linear regression, which we verify for all sub-Weibull random vectors through a unified analysis, and also prove a more general result related to restricted strong convexity in the process. In the final example, we consider the Lasso estimator for linear regression and establish its rate of convergence under much weaker than usual tail assumptions (on the errors as well as the covariates), while also allowing for misspecified models and both fixed and random design. To our knowledge, these are the first such results for Lasso obtained in this generality. The common feature in all our results over all the examples is that the convergence rates under most exponential tails match the usual ones under sub-Gaussian assumptions.  ( 3 min )
    Stabilized Doubly Robust Learning for Recommendation on Data Missing Not at Random. (arXiv:2205.04701v1 [cs.LG])
    In recommender systems, users always choose favorite items to rate, which results in data missing not at random and poses a great challenge for unbiased evaluation and learning of prediction models. Currently, the doubly robust (DR) method and its variants have been widely studied and demonstrate superior performance. However, we show that DR methods are unstable to extremely small propensities and rely on extrapolations, resulting in sub-optimal performances. In this paper, we propose a stabilized doubly robust (SDR) estimator to address the above limitations while retaining double robustness. Theoretical analysis shows that SDR has bounded bias, variance and generalization error bound under inaccurate imputed errors and arbitrarily small propensities. In addition, we propose a novel learning approach for SDR that updates the imputation, propensity, and prediction models cyclically, achieving more stable and accurate predictions. Extensive experiments show that our approach significantly outperforms the existing methods.  ( 2 min )
    Fixed-point iterations for several dissimilarity measure barycenters in the Gaussian case. (arXiv:2205.04806v1 [stat.CO])
    In target tracking and sensor fusion contexts it is not unusual to deal with a large number of Gaussian densities that encode the available information (multiple hypotheses), as in applications where many sensors, affected by clutter or multimodal noise, take measurements on the same scene. In such cases reduction procedures must be implemented, with the purpose of limiting the computational load. In some situations it is required to fuse all available information into a single hypothesis, and this is usually done by computing the barycenter of the set. However, such computation strongly depends on the chosen dissimilarity measure, and most often it must be performed making use of numerical methods, since in very few cases the barycenter can be computed analytically. Some issues, like the constraint on the covariance, that must be symmetric and positive definite, make it hard the numerical computation of the barycenter of a set of Gaussians. In this work, Fixed-Point Iterations (FPI) are presented for the computation of barycenters according to several dissimilarity measures, making up a useful toolbox for fusion/reduction of Gaussian sets in applications where specific dissimilarity measures are required.  ( 2 min )
    A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks. (arXiv:2205.05040v1 [cs.LG])
    In distributed training of deep neural networks or Federated Learning (FL), people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the FL setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works and design a communication-efficient gradient clipping algorithm. This algorithm can be run on multiple machines, where each machine employs a gradient clipping scheme and communicate with other machines after multiple steps of gradient-based updates. Our algorithm is proved to have $O\left(\frac{1}{N\epsilon^4}\right)$ iteration complexity for finding an $\epsilon$-stationary point, where $N$ is the number of machines. This indicates that our algorithm enjoys linear speedup. We prove this result by introducing novel analysis techniques of estimating truncated random variables, which we believe are of independent interest. Our experiments on several benchmark datasets and various scenarios demonstrate that our algorithm indeed exhibits fast convergence speed in practice and thus validates our theory.  ( 2 min )
    Turtle Score -- Similarity Based Developer Analyzer. (arXiv:2205.04876v1 [stat.ML])
    In day-to-day life, a highly demanding task for IT companies is to find the right candidates who fit the companies' culture. This research aims to comprehend, analyze and automatically produce convincing outcomes to find a candidate who perfectly fits right in the company. Data is examined and collected for each employee who works in the IT domain focusing on their performance measure. This is done based on various different categories which bring versatility and a wide view of focus. To this data, learner analysis is done using machine learning algorithms to obtain learner similarity and developer similarity in order to recruit people with identical working patterns. It's been proven that the efficiency and capability of a particular worker go higher when working with a person of a similar personality. Therefore this will serve as a useful tool for recruiters who aim to recruit people with high productivity. This is to say that the model designed will render the best outcome possible with high accuracy and an immaculate recommendation score.  ( 2 min )
    Don't Throw it Away! The Utility of Unlabeled Data in Fair Decision Making. (arXiv:2205.04790v1 [stat.ML])
    Decision making algorithms, in practice, are often trained on data that exhibits a variety of biases. Decision-makers often aim to take decisions based on some ground-truth target that is assumed or expected to be unbiased, i.e., equally distributed across socially salient groups. In many practical settings, the ground-truth cannot be directly observed, and instead, we have to rely on a biased proxy measure of the ground-truth, i.e., biased labels, in the data. In addition, data is often selectively labeled, i.e., even the biased labels are only observed for a small fraction of the data that received a positive decision. To overcome label and selection biases, recent work proposes to learn stochastic, exploring decision policies via i) online training of new policies at each time-step and ii) enforcing fairness as a constraint on performance. However, the existing approach uses only labeled data, disregarding a large amount of unlabeled data, and thereby suffers from high instability and variance in the learned decision policies at different times. In this paper, we propose a novel method based on a variational autoencoder for practical fair decision-making. Our method learns an unbiased data representation leveraging both labeled and unlabeled data and uses the representations to learn a policy in an online process. Using synthetic data, we empirically validate that our method converges to the optimal (fair) policy according to the ground-truth with low variance. In real-world experiments, we further show that our training approach not only offers a more stable learning process but also yields policies with higher fairness as well as utility than previous approaches.  ( 2 min )
    KEMP: Keyframe-Based Hierarchical End-to-End Deep Model for Long-Term Trajectory Prediction. (arXiv:2205.04624v1 [cs.CV])
    Predicting future trajectories of road agents is a critical task for autonomous driving. Recent goal-based trajectory prediction methods, such as DenseTNT and PECNet, have shown good performance on prediction tasks on public datasets. However, they usually require complicated goal-selection algorithms and optimization. In this work, we propose KEMP, a hierarchical end-to-end deep learning framework for trajectory prediction. At the core of our framework is keyframe-based trajectory prediction, where keyframes are representative states that trace out the general direction of the trajectory. KEMP first predicts keyframes conditioned on the road context, and then fills in intermediate states conditioned on the keyframes and the road context. Under our general framework, goal-conditioned methods are special cases in which the number of keyframes equal to one. Unlike goal-conditioned methods, our keyframe predictor is learned automatically and does not require hand-crafted goal-selection algorithms. We evaluate our model on public benchmarks and our model ranked 1st on Waymo Open Motion Dataset Leaderboard (as of September 1, 2021).  ( 2 min )
    A Probabilistic Generative Model of Free Categories. (arXiv:2205.04545v1 [cs.AI])
    Applied category theory has recently developed libraries for computing with morphisms in interesting categories, while machine learning has developed ways of learning programs in interesting languages. Taking the analogy between categories and languages seriously, this paper defines a probabilistic generative model of morphisms in free monoidal categories over domain-specific generating objects and morphisms. The paper shows how acyclic directed wiring diagrams can model specifications for morphisms, which the model can use to generate morphisms. Amortized variational inference in the generative model then enables learning of parameters (by maximum likelihood) and inference of latent variables (by Bayesian inversion). A concrete experiment shows that the free category prior achieves competitive reconstruction performance on the Omniglot dataset.  ( 2 min )
    Matrix and graph representations of vine copula structures. (arXiv:2205.04783v1 [stat.ML])
    Vine copulas can efficiently model a large portion of probability distributions. This paper focuses on a more thorough understanding of their structures. We are building on well-known existing constructions to represent vine copulas with graphs as well as matrices. The graph representations include the regular, cherry and chordal graph sequence structures, which we show equivalence between. Importantly we also show that when a perfect elimination ordering of a vine structure is given, then it can always be uniquely represented with a matrix. O. M. N\'apoles has shown a way to represent them in a matrix, and we algorithmify this previous approach, while also showing a new method for constructing such a matrix, through cherry tree sequences. Lastly, we prove that these two matrix-building algorithms are equivalent if the same perfect elimination ordering is being used.  ( 2 min )
    Real-time Forecasting of Time Series in Financial Markets Using Sequentially Trained Many-to-one LSTMs. (arXiv:2205.04678v1 [cs.LG])
    Financial markets are highly complex and volatile; thus, learning about such markets for the sake of making predictions is vital to make early alerts about crashes and subsequent recoveries. People have been using learning tools from diverse fields such as financial mathematics and machine learning in the attempt of making trustworthy predictions on such markets. However, the accuracy of such techniques had not been adequate until artificial neural network (ANN) frameworks were developed. Moreover, making accurate real-time predictions of financial time series is highly subjective to the ANN architecture in use and the procedure of training it. Long short-term memory (LSTM) is a member of the recurrent neural network family which has been widely utilized for time series predictions. Especially, we train two LSTMs with a known length, say $T$ time steps, of previous data and predict only one time step ahead. At each iteration, while one LSTM is employed to find the best number of epochs, the second LSTM is trained only for the best number of epochs to make predictions. We treat the current prediction as in the training set for the next prediction and train the same LSTM. While classic ways of training result in more error when the predictions are made further away in the test period, our approach is capable of maintaining a superior accuracy as training increases when it proceeds through the testing period. The forecasting accuracy of our approach is validated using three time series from each of the three diverse financial markets: stock, cryptocurrency, and commodity. The results are compared with those of an extended Kalman filter, an autoregressive model, and an autoregressive integrated moving average model.  ( 2 min )
    Entropic CLT for Order Statistics. (arXiv:2205.04621v1 [cs.IT])
    It is well known that central order statistics exhibit a central limit behavior and converge to a Gaussian distribution as the sample size grows. This paper strengthens this known result by establishing an entropic version of the CLT that ensures a stronger mode of convergence using the relative entropy. In particular, an order $O(1/\sqrt{n})$ rate of convergence is established under mild conditions on the parent distribution of the sample generating the order statistics. To prove this result, ancillary results on order statistics are derived, which might be of independent interest.  ( 2 min )
    A Unified Bayesian Framework for Pricing Catastrophe Bond Derivatives. (arXiv:2205.04520v1 [q-fin.PR])
    Catastrophe (CAT) bond markets are incomplete and hence carry uncertainty in instrument pricing. As such various pricing approaches have been proposed, but none treat the uncertainty in catastrophe occurrences and interest rates in a sufficiently flexible and statistically reliable way within a unifying asset pricing framework. Consequently, little is known empirically about the expected risk-premia of CAT bonds. The primary contribution of this paper is to present a unified Bayesian CAT bond pricing framework based on uncertainty quantification of catastrophes and interest rates. Our framework allows for complex beliefs about catastrophe risks to capture the distinct and common patterns in catastrophe occurrences, and when combined with stochastic interest rates, yields a unified asset pricing approach with informative expected risk premia. Specifically, using a modified collective risk model -- Dirichlet Prior-Hierarchical Bayesian Collective Risk Model (DP-HBCRM) framework -- we model catastrophe risk via a model-based clustering approach. Interest rate risk is modeled as a CIR process under the Bayesian approach. As a consequence of casting CAT pricing models into our framework, we evaluate the price and expected risk premia of various CAT bond contracts corresponding to clustering of catastrophe risk profiles. Numerical experiments show how these clusters reveal how CAT bond prices and expected risk premia relate to claim frequency and loss severity.  ( 2 min )

  • Open

    IsaacGym vs. Brax?
    Is someone able to compare IsaacGym to Brax? I'm particularly interested in how efficient Brax is in comparison to IsaacGym when running on a strong Desktop PC. Some companies might have the resources to waste energy on inefficient CPU computations, I don't. So I'm not really interested in the distributed computing part of Brax, at least as far as simulation on CPUs is concerned. submitted by /u/felixcra [link] [comments]  ( 1 min )
    Question about the OpenAI 5 Dota 2 Bot
    I have been reading about this bot and it says it learned by playing an insane number of games that are equivalent to many years real time. My question is how it is possible to play this many games? The only way I had known of previously when it comes to reinforcement learning is to completely re-code the game so you don't have to deal with the animations and time that comes with actually playing the game and can speed it up. The only other way I could think of is if they got hundreds of thousands of different accounts to play Dota 2 but that's certainly not what they did. Does anyone know how they were able to speed the playing process up to play so many games? submitted by /u/TheGeniusSkipper [link] [comments]  ( 1 min )
    Callbacks in the __init__ method of a gym environment
    I am working with a multiagent gym environment and this is the init method: def __init__(self, world, reset_callback=None, reward_callback=None, observation_callback=None, info_callback=None, done_callback=None, post_step_callback=None, shared_viewer=True, discrete_action=True): Can someone explain what those callbacks are? I would like to send at each timestep the policy LSTM state to the env and I would like to do so by passing it in the step function along with the action. Does that make sense? submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Books for engineers with ML experience?
    I've been working in ML and AI for a while, so I am familiar with deep neural networks and implementing them in pytorch and tensorflow. But every time I try to work on reinforcement learning side project I get end up with models that don't seem to learn anything useful. Part of it may just be the reward space and learning process is different than say training a conv network to extract features and label images which is a much easier to define problem. The state spaces, time component, and lack of clear labels just keeps tripping me up. I'd love a book I can work through to help me get through these issues, but don't really want to start with "what is a neural network" etc. submitted by /u/caedin8 [link] [comments]  ( 2 min )
    How to create a done and reward function for drone model?
    I want to implement DDPG algorithm on a drone in pybullet-drones simulator. In this simulator, I have following observations of a drone: X, Y, Z position in WORLD_FRAME (in meters, 3 values) Quaternion orientation in WORLD_FRAME (4 values) Roll, pitch and yaw angles in WORLD_FRAME (in radians, 3 values) The velocity vector in WORLD_FRAME (in m/s, 3 values) Angular velocity in WORLD_FRAME (3 values) Motors' speeds (in RPMs, 4 values) I want the drone to follow a circular trajectory. And simulation to be done when drone hits the ground. So how can I create a reward and done function for this problem? submitted by /u/Better-Ad8608 [link] [comments]  ( 1 min )
    How to utilize the existing while training the agent
    Hi all, I am currently trying to teach my robot-manipulator how to reach a goal position by considering the overall energy consumption. Here, I would like to integrate the existing knowledge such as "try to avoid using q1, as it consumes a lot of energy". How could I initialize the training by utilizing this knowledge to boost the training speed? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    I am training a chess AI using self play. I want to train the AI and then save the model and make it play against the stockfish engine. If I set reset timestep to False will it still continue training the same model when I call model.learn again
    submitted by /u/madmax_wush [link] [comments]  ( 1 min )
    Help needed finding a "complex" RL environment to test an idea on
    Hi All, I am a master student in AI and I have developed a new training method and have been testing it on a "simple" environment implemented in Python/Gym/Stable-baselines3/Pygame. My supervisor has said that I need to test the idea on a more "complex" environment, preferably with a high dimensional state space (perhaps in the hundreds or thousands). Are there any "complex" environments that would be easy to quickly implement in Python/Gym/Pygame? I am running out of time so can't build the environment from scratch. Thanks alot in advance. submitted by /u/C_BearHill [link] [comments]  ( 1 min )
    Conv3D Gym Env?
    Hi all, I'm working on a RL project in the medical domain, and I want to use 3D grids as the observation for my agent. I have two questions: Does anyone know of any work that's been done using 3D grids and Conv3D? Preferably as a Gym env! I've seen a lot of work with 3D environments but the observation is usually just an image of the 3D env.. Would it be better to use mesh or voxels as the input? Thank you! submitted by /u/leozinho2r [link] [comments]  ( 1 min )
    Evaluation seed
    I train with PPO using 64 envs, each of them having a different seed, and then evaluate on 10 envs with different seeds. Can you tell me please if it is recommended to use in evaluation a sample of the seeds used in the train, or if they must be completely different? submitted by /u/FaithlessnessSuper46 [link] [comments]  ( 1 min )
  • Open

    Window to the Fourth Dimension | @artificial_artists on Instagram
    submitted by /u/atster11 [link] [comments]
    Blood Bridge
    submitted by /u/Hacknaut [link] [comments]
    Hey everyone! I'm not sure if this is the place for this, but I made an Instagram to post all the art I create on Dream by Wombo. Some of the pieces are really cool and I'm gonna keep posting on there so I figured I'd share the page with all of you in case you're interested! @Artificial_artists
    submitted by /u/atster11 [link] [comments]  ( 1 min )
    Data Science Interview Questions
    Preparing for a data scientist interview is tough since the questions you will be asked about data science are unclear. An interviewer may surprise you with a set of unexpected questions, regardless of how much job experience you have or what data science credentials you possess. Read more submitted by /u/ridamughal110 [link] [comments]
    Use of AI in the Manufacturing Industry
    With continuous advancements in Artificial Intelligence (AI), the manufacturers are spearheading to apply of it in their manufacturing processes to boost product quality, operational efficiency, workforce safety, and many more. Read more submitted by /u/ridamughal110 [link] [comments]
    Aiplague - Lost Underwater Garden (4K 60 FPS) Disco Diffusion
    submitted by /u/nalr00n [link] [comments]
    Doing body measurements with AI
    Hi r/artificial, I am developing an app in which I have to measure a person's measurements. I read somewhere that Artificial Intelligence is being used to measure houses etc. Is it already possible to use AI to do body measurements of people? For example, their arm or chest size? Also, I would like to know what the costs are of developing this feature. Can you guys help me pursue my idea? submitted by /u/notmycupofnft [link] [comments]  ( 1 min )
    AI Dream 44 - Epic Trippy Dream (Dragon Wave) 8x
    submitted by /u/LordPewPew777 [link] [comments]
    New AI Robot Chef R2-D-Chew Learns To 'Taste' To Improve Its Cooking
    submitted by /u/getrich_or_diemining [link] [comments]
    Overview of GraphSage for GNNs
    submitted by /u/aidev2040 [link] [comments]
    Question about Roko's Basilisk (I'm not here to create discussion, it's about a uni essay)
    Sorry if you are tired of this stupid basilisk, I don't believe in it, but I'm doing an essay about it and I have a question: Why does it only punish those who have heard about him? Why not punish everyone? I know it has to do about Pascal's wager and that "God doesn't punish those who don't know about them" but I still didn't really understand it. My google-fu is too weak for this one, I searched and didn't find anything. Thank you for taking the time for reading this! submitted by /u/Zholotoi [link] [comments]  ( 1 min )
    Ethical Concerns as a Result of AI!
    Machine learning is fast progressing thanks to neural networks for the following reasons: The number of data banks has exploded. The massive boost in processing power Algorithms for machine learning have vastly improved. Read full article: https://us.sganalytics.com/blog/top-ethical-challenges-in-ai-the-price-of-progress/ submitted by /u/JencyJane [link] [comments]
    BlobGAN enables object manipulation in an image
    submitted by /u/imapurplemango [link] [comments]  ( 1 min )
    How did you learn Ai?
    submitted by /u/SATelite_Media [link] [comments]  ( 1 min )
    Dynasties and Dystopia (made with starryai)
    submitted by /u/Losthel [link] [comments]  ( 1 min )
    Image Classification With TensorFlow.js
    submitted by /u/RubiksCodeNMZ [link] [comments]
    Deepmind's latest AI has better visual understanding by combining a visual model and a language model
    submitted by /u/Zirius_Sadfaces [link] [comments]
    How do I create GAN imagery like Brock Hampton’s visualizers for ROADRUNNER?
    submitted by /u/AMORALESPLATA [link] [comments]
    MIT Researchers Create ‘ExSum’: A Mathematical Framework To Evaluate Explanations Of Machine Learning Models And Quantify How Well People Understand Them
    Machine learning is frequently referred to as a “black box” because the interactions between input and output become increasingly opaque as the model’s complexity grows. People’s understanding of these models is confined to how data is entered and final choices are made, and there is not much clarity in how these models make their predictions. While previous research has been done to determine how accurate the explanations given by these models are, the question of how quickly and reliably individuals grasp these models remains an unexplored territory. Interpretability methods are being developed to comprehend better the functioning of blackbox models, which is necessary for their reliable deployment. As a stepping stone in this field, researchers from the Computer Science and Artificial Intelligence Laboratory and Microsoft Research have created ground-breaking research by developing a mathematical framework called explanatory summary (ExSUM) for evaluating and quantifying how well individuals understand machine learning models. ExSUM exposes different flaws in existing practice and aids in developing accurate model knowledge by identifying the model’s easily missed features. Other explanations, such as human alignment, robustness, and counterfactual minimality, are also included in the framework. The findings will be presented at the Conference of the North American Chapter of the Association for Computational Linguistics. Continue Reading Paper: https://arxiv.org/pdf/2205.00130.pdf Github: https://yilunzhou.github.io/exsum/ submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    27 audio samples based on Kendrick Lamar's 'Top of the Morning' hook & VQGAN + Clip art
    submitted by /u/gloriousapplecart [link] [comments]
  • Open

    [D] Thoughts on Seldon and their MLOps?
    Context: I am an AI Product Manager at a start up, trying to clean up their ML architecture (moving from batch inference to real-time at the moment). Just wondering if anyone has worked with Seldon before their MLOps needs? They claimed their turnaround time could be a week (I am heavily skeptical of this), but they do claim to serve some bigger clients so just wondering on folks' thoughts! submitted by /u/baxter3851 [link] [comments]  ( 1 min )
    [P] Accelerated Inference with Optimum and Transformers Pipelines
    Hey there 👋 It’s Lewis here from the open-source team at Hugging Face 🤗. I'm excited to share the latest release of our Optimum library, which provides a suite of performance optimization tools to make Transformers run fast on accelerated hardware! This release introduces a new set of inference classes, which resemble the autoclasses from Transformers, but run an ONNX model on ONNX Runtime. These classes are API compatible with ordinary PyTorch / TensorFlow models, which means you can run inference using the `nifty pipeline() function from Transformers! Here's a quick example: from transformers import AutoTokenizer, pipeline from optimum.onnxruntime import ORTModelForQuestionAnswering # Load ONNX checkpoint using the ONNX Runtime inference class model = ORTModelForQuestionAnswering.…  ( 1 min )
    [R] RWKV-v2-RNN : A parallelizable RNN with transformer-level LM performance, and without using attention
    Hi guys. I am an independent researcher and you might know me (BlinkDL) if you are in the EleutherAI discord. I have built a RNN with transformer-level performance, without using attention. Moreover it supports both sequential & parallel mode in inference and training. So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding. https://github.com/BlinkDL/RWKV-LM I am training a L24-D1024 RWKV-v2-RNN LM (430M params) on the Pile (trained for 25B tokens as of now) with promising results, and it might reach GPT-Neo performance within 100B tokens. https://preview.redd.it/aftmbw4b4py81.png?width=852&format=png&auto=webp&s=56cf061dba0e90c79cbec7f37de86921bad3744c All of the trained models will be open-source. Inference is very fast (only matrix-vector multiplications, no matrix-matrix multiplications) even on CPUs, and I believe you can run a 1B params RWKV-v2-RNN with reasonable speed on your phone. It is inspired by Apple's AFT (https://arxiv.org/abs/2105.14103) with a number of my own tricks, such as: RNNify it, and use my CUDA kernel to speedup training (https://github.com/BlinkDL/RWKV-CUDA) Token-shift (https://github.com/BlinkDL/RWKV-LM#token-shift-time-shift-mixing) SmallInitEmb (https://github.com/BlinkDL/SmallInitEmb) which helps the embedding quality, and stabilizes Post-LN (which is what I am using). I also transferred some time-related parameters from a small model to a large model, to speed up the convergence. Basically the model learns to focus more on short-distance interactions in early layers, and long-distance interactions in later layers. https://preview.redd.it/ibk4ic0b6py81.png?width=865&format=png&auto=webp&s=78e4f794abd0fe25c8af8fd6634836a472e4120a Please feel free to ask questions :) submitted by /u/bo_peng [link] [comments]  ( 3 min )
    [D] Simple question - Does anyone know the size of BUFF dataset?
    Hi everyone, Upon reading [PIFuHD: Multi-Level pixel-aligned implicit function for high-resolution 3d human digitization](https://arxiv.org/abs/2004.00452), I recognized that they used [BUFF dataset](https://buff.is.tue.mpg.de/index.html) which is open only for academic purpose. ​ To access this data, I have to register academic e-mail and send some declaration papers.... ​ I just want to know how big this dataset is, maybe more than TB..? ​ Does anyone know how big this is? submitted by /u/Frequent-Desk9222 [link] [comments]  ( 1 min )
    [D]How to evaluate complementary datasets for ML models?
    Evaluating ML models is a fundamental task and subfield of the Machine Learning practice. On the other hand, I was not able to find any existing materials, guides, protocols, papers on how to proceed with evaluating/scoring complementary datasets (resulting multiple new features a.k.a feature set) when added to an existing model (and retrained with these new features added). This can be rather described as a with/without based comparative approach. Let's say that there is some kind of ensemble model (e.g. catboost, lighGBM) trained on some X_0 and y data with all the relevant testing metrics for that model type (classification or regression). We receive some new feature sets, X_1, X_2 ... X_n. Our goal is to get an overview whether [X_0 X_1], [X_0 X_2] .... [X_0 X_n] extended feature mat…  ( 2 min )
    [P] Coco Image Semantic Segmentation Dataset Generator Command Line Tool
    Hi everyone Just wanted to share a really small (micro) python command line utility that interfaces with `pycoco` library that allows you to generate tiff masks from the coco image dataset for training with semantic segmentation (i.e. UNet) where you can also filter by categories. Found it useful if you want to extract specific images from the Coco dataset for your own semantic segmentation project. Tiff images contain pixel values already representing class labels 0, 1, 2 etc... Hope someone else finds it useful! https://github.com/ralampay/pycocosegmentor submitted by /u/ralampay [link] [comments]  ( 1 min )
    [Research] Industry Analysis: The AI Fairness Toolkits Landscape
    A thoughtful approach to AI ethics is becoming increasingly important for all organizations deriving value from AI. We hope that by providing an overview of the top toolkits and resources that exist – starting with Fairness and Robustness – will help more companies adopt AI responsibly, with ethical principles at the core. https://www.borealisai.com/en/blog/industry-analysis-ai-fairness-toolkits-landscape/ submitted by /u/BorealisAI [link] [comments]  ( 1 min )
    [D] Data Scientist, ML Research Scientist, MLE, and Data Engineer
    I am new to non-academic research, but it seems like data scientists sometimes get unnecessary shat on. From talking to people and reading job descriptions, DS (especially at a startup) often seems closer to a Research Scientist than MLE, and especially DE. If what you want to do is research and model development it seems like DS is the best option next to RS - which seem to be the hardest positions to get. I was initially weary of taking a DS role that might lock me out of being considered for future RS positions, but after getting more exposure I'm doubting that is true. What do you think? I'm curious to hear more opinions because lots of people/companies seem to have different definitions for these roles. submitted by /u/Althonse [link] [comments]  ( 2 min )
    [R] NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality
    submitted by /u/tobyoup [link] [comments]  ( 3 min )
    [P] Cluster and analyze text using both embedding and GPT models (interactive visualizations)
    Hi r/MachineLearning, You have likely seen umap visualizations of text embeddings before. I have been interested in imposing more information on such a figure to help in a text analysis flow. So I'd love to share two interactive exhibits with you: 1) Top 10,000 Hacker News articles of all time 2) Top 3,000 posts in Ask HN I found the following steps produce an interesting result: Embed the text (for this example, I'm using the titles of the top scoring 10K posts on hacker news) Cluster the embeddings (kmeans) UMAP and impose the clusters on the plot Extract keywords from the clusters (using cTFIDF from the awesome BERTopic, which now also supports kmeans!) Cluster Naming: The last step is work in progress, which is to have a generative language model assign a better name to a cluster (using the extracted keywords). The post is linked below. In addition to finding super interesting discussions on life insights, software career insight, content recommendations, it also contains a notebook and the dataset as well as the embeddings of the top 3K posts in Ask HN. ​ Combing For Insight in 10,000 Hacker News Posts With Text Clustering https://txt.cohere.ai/combing-for-insight-in-10-000-hacker-news-posts-with-text-clustering/ ​ Hope you find it useful! I'm fascinated about flows that use both generative and embeddings models as these are, for now, two stable NLP methods in a fast-changing field. Feel free to share any experiences or projects you've done along those lines! submitted by /u/jayalammar [link] [comments]  ( 2 min )
    [P] New lightweight library for head-gesture detection
    The Nodding Pigeon library provides a pre-trained model and a simple inference API for detecting head gestures in short videos. Under the hood, it uses Google MediaPipe for collecting the landmark features. For ML practitioners, this project is also an example of using generative data from a small base-dataset for model training. Please take a look! :) https://github.com/bhky/nodding-pigeon submitted by /u/xtorch501 [link] [comments]  ( 1 min )
    Reconstructing data [Discussion]
    Hey guys, as a physicist, I learned supervised ML on the fly. I optimize multi-variable functions to reproduce quantum mechanics simulations.I have a quite different problem here. I would like to reconstruct data. I have thousands of line of data, each line has several attributes. One of those attributes is missing for some data. It can only have a few possible values. I need to reconstruct it, based on the other attributes. I started to dig a bit the possible methods I could use. Could you give me a few advices, opinions, to get me started? I would greatly appreciate any income. PS: I code in python submitted by /u/ant_two [link] [comments]  ( 3 min )
    [R] DALLE-2 paper explained - model architecture, results and image manipulation
    Here is a video explaining the model architecture of the DALLE-2 architecture: https://youtu.be/Z8E3LxqE49M The paper title is, "Hierarchical Text-Conditional Image Generation with CLIP Latents" and the arxiv link to the paper is here: https://arxiv.org/abs/2204.06125 Official website is here: https://openai.com/dall-e-2 Hope its useful. submitted by /u/Combination-Fun [link] [comments]
    [P] Open-source to speed up deep learning inference by leveraging multiple optimization techniques (deep learning compilers, quantization, half precision, etc)
    Hi everyone, my name is Emile. After trying for a long time many tools to speed up AI inference, with some colleagues I have built an open-source library to bundle the best techniques you could find around and consolidated them into a single interface, this opensource library (today's release features the integration of most of these techniques). Let me know what you think of this OSS! Thank you This library is called nebullvm and takes your AI model as input and outputs an optimized version that runs 2-30 times faster on your hardware. Nebullvm tests multiple optimization techniques (deep learning compilers, quantization, sparsity, distillation, and more) to identify the optimal way to execute your AI model on your specific hardware. The library can speed up your model 2 to 10 times without loss of performance (thanks to deep learning compilers, e.g. tensorrt, openvino, mlir, etc.), or up to 20-30 times if you specify that you are willing to trade off a self-defined amount of accuracy/precision to achieve even lower latency and a lighter model (which might be useful for edge devices). This further acceleration is achieved by leveraging techniques that slightly modify the graph of the model to make it lighter, such as quantization, half-precision distillation, sparsity, etc. (more information about these techniques on nebullvm github https://github.com/nebuly-ai/nebullvm). The goal of nebullvm is to help other developers benefit from the most advanced inference optimization techniques without having to spend countless hours understanding, installing, testing and debugging these powerful technologies. I hope you enjoy the project, and let me know if you have any comments or advice or the acceleration you get on your model/hardware submitted by /u/emilec___ [link] [comments]  ( 2 min )
    [D] CIKM 2022 - Call for Demonstrations format
    As part of my PhD I have developed a web application in which one can apply different explainability methods on deep learning models and visualize the outcomes. The web application also addresses some specific use-cases which I (currently) think have not been addressed thus far. The application is in an early prototype stage, which is why I was considering submitting a demonstration paper to the CIKM demonstrations track, since it might seem to fit pretty well, when considering the format of papers that have been accepted in the previous years as demos. However, I am not familiar with how the presentation format looks like in the case that a demo paper gets accepted. Is it similar to a poster session where, in the case of a demo, one would stand beside their laptop and demonstrate the application to other attendees as they pass by, or does everyone with an accepted demo paper get a small time slot where they have 15-20 minutes to present the application and than receive questions? I would appreciate it if anyone who attended previous CIKM demo tracks could explain the format a bit. submitted by /u/purposeless_username [link] [comments]  ( 1 min )
    [D] Is it true that every algorithm to detect deepfakes can be used to generate better deepfakes?
    This position seems somehow rational but I seem to remember I read somewhere, a long time ago, that this was indeed not the case but I can't remember what the argument was based upon. Is the game of detecting/creating deepfakes really a cat and mouse game? submitted by /u/kugkfokj [link] [comments]  ( 3 min )
    [P] Deep RL Zoo - A collection of Deep RL algorithms implemented with PyTorch
    Hi guys, I recently created a new repo on Github, it contains a lists of RL agents to solve discrete action space problems like classic control and Atari games. It includes the most recent algorithms from DeepMind like Never Give Up and Agent57 (also not fully tested on Atari games yet because lack of hardware resources). Hope you will find it helpful. https://github.com/michaelnny/deep_rl_zoo The post was originally posted on r/reinforcementlearning a few days ago, but I though re-posting here might reach more people, have a good day! submitted by /u/Top_Serve_2348 [link] [comments]  ( 1 min )
  • Open

    Is Data Mesh Fool’s Gold? Creating a Business-centric Data Strategy
    After talking to many customers recently at Dell Technologies World, I am very, very (very!) concerned how many organizations are putting their Data Strategy success into the hands of Data Meshes. Sorry, but I think the way that IT organizations are thinking about a Data Mesh is fool’s gold. I think the data mesh (along… Read More »Is Data Mesh Fool’s Gold? Creating a Business-centric Data Strategy The post Is Data Mesh Fool’s Gold? Creating a Business-centric Data Strategy appeared first on Data Science Central.  ( 6 min )
    Top Technological HR Skills for the HR Professional
    The Human Resource Management department had a very challenging time due to the pandemic. Globally, the coronavirus pandemic has created a lot of havoc. People were more concerned about health and safety. This pandemic has given rise to new realities like: Social distancing Online (virtual) work  Self-isolation Lockdowns Shutdowns Quarantine Shelter-in-place Essential businesses These concepts gave… Read More »Top Technological HR Skills for the HR Professional The post Top Technological HR Skills for the HR Professional appeared first on Data Science Central.  ( 4 min )
  • Open

    Create video subtitles with Amazon Transcribe using this no-code workflow
    Subtitle creation on video content poses challenges no matter how big or small the organization. To address those challenges, Amazon Transcribe has a helpful feature that enables subtitle creation directly within the service. There is no machine learning (ML) or code writing required to get started. This post walks you through setting up a no-code […]  ( 9 min )
  • Open

    Overview of GraphSage for GNNs
    submitted by /u/aidev2040 [link] [comments]
    Image Classification With TensorFlow.js
    submitted by /u/RubiksCodeNMZ [link] [comments]
  • Open

    Here is how an R-Learning agent beats humans
    Human intuition is usually good at dealing with concepts like averages or mean-values, whereas it often performs poorly when it comes to…  ( 9 min )
  • Open

    Creator Karen X. Cheng Brings Keen AI for Design ‘In the NVIDIA Studio’
    The future of content creation is in AI. This week In the NVIDIA Studio, discover how AI-assisted painting is bringing a new level of inspiration to the next generation of artists. The post Creator Karen X. Cheng Brings Keen AI for Design ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    Approximating a golden spiral with circular arcs
    The previous post included this image of a logarithm spiral passing through the corners of squares in a sequence of golden rectangles. The portion of the spiral in each square looks like a quarter of a circle. How well would circular arcs approximate the spiral? Very well. Here’s a plot. The circular arc inside the […] Approximating a golden spiral with circular arcs first appeared on John D. Cook.  ( 1 min )

  • Open

    [P] It's settled: AutoArima is a lot(!) faster and more accurate than FB-Prophet. Now you can replace it with just two lines of code without making changes to your pipeline
    We benchmarked on more than 100K series and show that you can improve MAPE forecast accuracy by 17% with 37x less computational time using Nixtlas StatsForecast. That's the difference between paying $10 or $296 on AWS. It’s time to overcome the false prophets. Check Nixtla's FB-Prophet adapter: https://github.com/Nixtla/statsforecast/tree/main/experiments/arima_prophet_adapter ​ https://preview.redd.it/fs3zqm4pcjy81.png?width=1280&format=png&auto=webp&s=326c82f04fae7df8934a434ba03fd5e683eadb99 The two lines you need submitted by /u/fedegarzar [link] [comments]  ( 1 min )
    [D] What are some good platforms to host new datasets published in ML conferences?
    I recently published an NLP paper which included a new dataset. What are some good platforms to host new datasets (our dataset is around 5GB)? Preferably, platforms which allow adding citation information and have low costs! I am currently checking out huggingface- but seems like there are some size and dataset type limitations in it. submitted by /u/shivamag99 [link] [comments]  ( 1 min )
    [D] Looking for the name of a particular regression technique
    This is a method I've read somewhere, but I can't remember the reference nor its name. You have (say) n = 500 points (say) in two dimensions. You draw a line for each pair of points. That is, you have n(n-1)/2 lines. If n is too large, you can sample a few thousands lines. The envelope of these line is your prediction band. The confidence level depends on the number of lines in your plot. It is entirely model-free, data driven. It also works if instead of a line, you use a 2-parameter curve (say you are dealing with logistic regression, rather than linear regression). Note that one of the lines will be the best fit according to L1 (that is, the one minimizing the absolute residual error), while the traditional regression line is a best L2 fit (minimizing square of residual error). The L1 version is more robust against outliers. This generalizes to d dimensions, with lines replaced by hyperplanes or hyper-surfaces. In this case, you need to look at all combinations of d points taken together to determines the all the hyperplanes. In practice, you only use a sample. How is the method referred to in the literature? What would be a good reference on the topic? submitted by /u/MLRecipes [link] [comments]  ( 1 min )
    [D] Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks
    Hi there, The Gradient has a new article that many of you might find interesting - Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks . The message-passing paradigm has been the “battle horse” of deep learning on graphs for several years, making graph neural networks a big success in a wide range of applications, from particle physics to protein design. From a theoretical viewpoint, it established the link to the Weisfeiler-Lehman hierarchy, allowing to analyse the expressive power of GNNs. We argue that the “node and edge-centric” mindset of current graph deep learning schemes imposes strong limitations that hinder future progress in the field. As an alternative, we propose physics-inspired “continuous” learning models that open up a new trove of tools from the fields of differential geometry, algebraic topology, and differential equations so far largely unexplored in graph ML. submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    [N] Hugging Face raised $100M at $2B to double down on community, open-source & ethics
    👋 Hey there! Britney Muller here from Hugging Face. We've got some big news to share! Hugging Face Full Series C Announcement: https://huggingface.co/blog/series-c TechCrunch: https://techcrunch.com/2022/05/09/hugging-face-reaches-2-billion-valuation-to-build-the-github-of-machine-learning/ We want to have a positive impact on the AI field. We think the direction of more responsible AI is through openly sharing models, datasets, training procedures, evaluation metrics and working together to solve issues. We believe open source and open science bring trust, robustness, reproducibility, and continuous innovation. With this in mind, we are leading BigScience, a collaborative workshop around the study and creation of very large language models gathering more than 1,000 researchers of all backgrounds and disciplines. We are now training the world's largest open source multilingual language model 🌸 Over 10,000 companies are now using Hugging Face to build technology with machine learning. Their Machine Learning scientists, Data scientists and Machine Learning engineers have saved countless hours while accelerating their machine learning roadmaps with the help of our products and services. ⚠️ But there’s still a huge amount of work left to do. At Hugging Face, we know that Machine Learning has some important limitations and challenges that need to be tackled now like biases, privacy, and energy consumption. With openness, transparency & collaboration, we can foster responsible & inclusive progress, understanding & accountability to mitigate these challenges. Thanks to the new funding, we’ll be doubling down on research, open-source, products and responsible democratization of AI. submitted by /u/Britney-Ramona [link] [comments]  ( 3 min )
    [D] Are there any software engineers that switched into a machine learning role and found it a lot more stressful due to deadlines combined with the uncertainty of research?
    [D] Are there any software engineers that switched into a machine learning role and found it a lot more stressful due to deadlines combined with the uncertainty of research? I found that after switching to an ML role where I am both managing and improving data pipelines while experimenting with different ML models and doing feature engineering is a lot more stressful than a standard SWE job. As a SWE I had tasks, and I knew I was capable of doing all of them given some amount of time, and I just peacefully worked my way through it and had a lot of down time. With ML, I am continually trying new things, having to wait for the model to train and be evaluated, all the while to come back with incremental to zero improvements while still having to meet monthly or quarterly milestones. I don’t have that guarantee that I know I can complete a task given some amount of time. There’s so much uncertainty in the ML research process and the stress is making me want to switch back to SWE. The data pipelining and SWE parts are easy, it’s the contemplation of why my model isn’t working, and what is the next appropriate step to take to improve it that is difficult for me. 90% of the time the things I try don’t lead to any improvement. To clear things up, I don’t have a PhD in statistics or CS, only an BS/MS in CS so while I studied ML from the CS side, my statistical understanding of ML isn’t the best, which I believe may be contributing to this. Any one else deal with this? Is this a common feeling in this field? submitted by /u/Legitimate_Bison3756 [link] [comments]  ( 10 min )
    [D] Stats person in a team with all computer science colleagues....
    This is something I am really curious to know.... Are there guys out there with 'predominantly stats' background who are working in a machine learning / deep-learning team with all colleagues from computer science degrees? If so, could you please respond with your experience on: - What are the major challenges you are facing, if any. - Do you find the thought processes and approach of your colleagues to data, model etc. very different from yours? Does it cause any issues in your work? - Do you ever feel overwhelmed with software development practices, installation, code check-in issues etc. etc. ? For slightly better context: - By 'predominantly stats' background, I mean folks with primary degrees in stats, with basic training in coding in Python R etc but little or no training in stuff like ML engineering, building pipelines, code check-in check-out, developing code for very complex deep learning models with many classes, modules, libraries etc. etc. I'd really appreciate an in-depth answer. submitted by /u/sol2296 [link] [comments]  ( 2 min )
    Pros/cons of using model on different data without retraining? [D]
    Hi all, I'm looking into a classification problem with each observation being an aggregated geographical area, which I've trained some classifying algorithms on. The response variable is a binary indicator of poverty. My question is what are benefits or drawbacks of using my best model without re-training it on even lower disaggregated data or data for different time periods? Thanks for your thoughts! submitted by /u/Pickles654 [link] [comments]  ( 1 min )
    [D] What are some ways to improve your *science* for a top-tier ML conference?
    There seemed to be some consensus in the responses to yesterday’s post asking about ways to improve your paper for a top-tier ML conference—the primary thing you should do is ensure your research makes a top-tier worthy contribution. That itself seemed like it might be worthy of further exploration, so I thought I would continue the discussion: What traits characterize a high value research idea or insight? What are good strategies for pursuing one? submitted by /u/EmmyNoetherRing [link] [comments]  ( 2 min )
    [D] [R] Does anyone know how to implement the MostPop algorithm in recommendation systems?
    The MostPop algorithm is basically an algorithm that calculates metrics based on the most popular items at the time. This algorithm keeps popping up in papers but there is no description of it being used to calculate RMSE or such errors. Using the knowledge that I currently have I can calculate metrics such as Hit Rate (how many users have interacted with the item i) but is there any way that I can use a modification of this algorithm or this algorithm itself to predict what rating would be given to item "i" by user "x"? submitted by /u/GustaMusto [link] [comments]  ( 1 min )
    [D] IJCAI 2022 camera ready review notification.
    Has anyone got the email from the editors after submitting the camera ready ? submitted by /u/random_effective [link] [comments]
    [D] Is there any AI solution to "fill in" images between images?
    For example, I provide a 2d image of a stickman on the left of the frame and a second image of the stickman on the right of the frame (as if he was moving from left to right). Would there be an AI solution that would output me an image of this stickman in the middle of the frame? Like an AI solution that guesses that this is a stickman moving from left to right and creates the image in-between the 2 images I provided. ​ I'm asking this while thinking about creating an animation using AI where I just provide the "main" images/drawings and it "fills out" the rest. I google a lot and couldn't find anything suitable for this. I'd love hearing from you recommended libs/github profiles/notebooks that could help me in achieving this. I'm a musician and data science practitioner and I'd love if I could unite these two worlds in a video clip. submitted by /u/refrigerador82 [link] [comments]  ( 2 min )
    [D] Looking for ideas on NLP project
    Hello, I am trying to use text mining/text processing/NLP to apply certain changes to a big text file. More precisely, I have a big text file in html format (specifically a law). This law is then amended or changed by another law (a much smaller html file). For example, a law can be something like: The ten commandments Thou shalt have no other gods before me Thou shalt not make unto thee any graven image Thou shalt not take the name of the Lord thy God in vain etc... Then a change would be something like: Changes to the Ten commandments In The ten commandments, item 1 is changed to "Thou shalt not murder". Item 2 is removed. In item 3, after the word "vain", a comma is added followed by the sentence "unless the situation calls for it". The output needs to be: The ten commandments Thou shalt not murder Thou shalt not take the name of the Lord thy God in vain, unless the situation calls for it It's a really dumb example but that's the gist of it. The laws are on average 10 to 40 pages long while the changes are usually a couple pages long. Basically, the input would be one giant text file and either one or if possible multiple text files that contain changes to be made to the original file. The output would be an amended law with all the changes applied. The language of the text is not English or any other popular language. The changes are not written in a standardized way and there is also declination and typos to consider, which is why I would prefer using an ML approach rather than analysing the text by brute force. How would you go about approaching a problem like this? I have some experience with neural nets and machine learning, but very little in NLP so I am pretty lost. submitted by /u/WaferPlenty677 [link] [comments]  ( 1 min )
    [R] Happy to share my paper and Python code on efficient implementation of incremental proximal-point method for training machine-learning models.
    Paper: https://arxiv.org/abs/2205.01457 Code: https://github.com/alexshtf/inc_prox_pt Models are often trained using variants of the gradient update rule: xₜ₊₁ = xₜ-β∇ƒ(xₜ) where x is the model parameters vector, and ƒ is the cost function associated with the current (mini-batch of) training sample. This rule has another well-known interpretation - the proximal view: xₜ₊₁ = argmin { ƒ(xₜ) + ⟨∇ƒ(xₜ), x-xₜ⟩ + β/2 ‖x-xₜ‖² }, meaning "balance between minimizing a linear approx. of ƒ at xₜ and being close to xₜ". The step-size β determines the balance between the two opposing forces. So what happens if we replace the linear approximation with the cost function itself? We obtain xₜ₊₁ = argmin { ƒ(x) + β/2 ‖x-xₜ‖² }. This is called the "proximal operator of ƒ", and algorithms of this sort are called "proximal-point methods". So why not always exploit the cost itself, instead of just its slope? What's the catch? Depending on the complexity of ƒ, the above can be extremely challenging to compute. So why should we bother? Various works, for example https://arxiv.org/pdf/1903.08619.pdf by Asi and Duchi, show that such methods provide better resilience to step-size choice, which may make hyper-parameter tuning much cheaper - just guess a few step-sizes and you will get decent performance. Various extensions and works in online optimization also show nice stability properties, such as resilliance to temporal variability of the adversarial cost function. In this paper I developed efficient algorithms for computing the above step for a variety of functions ƒ useful in machine learning, and published a Python library based on PyTorch, which practitioners can use for some problem classes described in the paper, and researchers developing new proximal-point methods can use it to numerically demonstrate their brand-new "super proximal-point method with some cool momentum" on interesting problems. submitted by /u/alexsht1 [link] [comments]  ( 3 min )
    [D] What is the largest / most diverse GAN model currently out there?
    Hi community, I'm currently building a fork of StyleCLIP global directions which allows you to control multiple semantic parameters simultaneously to generate and edit an image with StyleGAN and CLIP in realtime. I want to showcase its potential as a design tool. Unfortunately, GAN weights are trained on very domain-specific (faces, cars, churches) data. This makes them inferior to modern diffusion models which I can use to generate whatever comes to mind. Although I know we won't have a GAN-based DALL-E counterpart anytime soon, I still would love to use my system with weights that can output a wide variety of things. So what are the biggest pretrained GAN models out there? Did anyone try to train one on LAION-400-M or similar? submitted by /u/_chinatown [link] [comments]  ( 1 min )
    [D] What are some ways to improve your paper for a top-tier ML conference?
    ^^^ I'm sure this is a really broad question, with no one-size-fits-all answer. I am an undergrad who is very new to this research field and still don't have a good sense of what constitutes a well-written, top-conference-worthy paper, so I would welcome any thoughts. From what I've seen, some of the best-written papers have the following: - A very clear statement of the impact of their contribution, usually near the beginning. This tells the reader why they should care about this work (and to the reviewer why they should advocate for acceptance :D). - Usually a compelling teaser image in the second column of the first page (this is also good for Twitter publicity in the future). - Not sure if this is specific to my subfield (NLP), but the main results are enumerated and bolded at the …  ( 5 min )
    Calling all Developers [P]
    Wanna help people get away from being frustrated by those pesky menu bots that run you through different options. Looking for an Alexa like AI voice assistant that can solve problems it has learned before and send the problems they don't know to customer service. If interested comment below or if you think its dumb also let me know all views welcome here. submitted by /u/Rhiquire [link] [comments]
  • Open

    Classification & Generation of Microscopy Images for Malaria via Artificial Neural Networks in Ghana
    submitted by /u/pasticciociccio [link] [comments]
    A.I. Farms?
    While looking through some dall-e 2 pictures I thought to myself, are there companies building ai farms in a way? Basically taking requests from other companies to train an ai to do a certain thing. That could be a money maker but as always someone probably already thought of that? submitted by /u/Swaggyswaggerson [link] [comments]
    Python AI for files
    So I got that exercise, where I get like 20 files with 5000 values on 6 columns (Acceleration X,Y,Z and Gyro X,Y,Z). The exercise is to code an AI for a smart watch, to be able to detect, if the user fell down or is just shaking his hand for example. Each of the 20 files either represents a fall, or no fall. I've to code and train that AI on python, but have no Idea how. I dont have the 20 files yet, but I have to code like a template, where I can just insert the files and the moment I get them, I should be able to train the AI. At first I've mistaken that exercise and thought, I would get only 6 Values for one row and coded the AI like you see it in the picture. But then I realized, if you fall, you dont have a specific of acceleration and gyro, but rather an intervall of values. Then I remembered, that each of the files describe a recording for 5 seconds, where every 10 miliseconds one value got recorded and written into the file. But I really dont have any Idea, how I compare files with 5000 values on 6 columns, or rather write an AI, which can predict, when getting such a file, if its a fall or not. submitted by /u/Specific-Feeling-666 [link] [comments]  ( 2 min )
    Here's a repository where I try to keep up with the most interesting research papers of 2022. It is a curated list of the latest breakthroughs in AI and Data Science by release date with a clear video explanation, link to a more in-depth article, and code (if applicable).
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    Last Week in AI: New methods to detect deepfakes, AI for smarter farming, AI to detect heart problems from Apple Watch, and more!
    submitted by /u/regalalgorithm [link] [comments]
    This deep learning technique solves one of the tough challenges of robotics
    submitted by /u/bendee983 [link] [comments]
    Researchers at the Imperial College London have shown it is possible to perform artificial intelligence using tiny nanomagnets that interact like neurons in the brain.
    submitted by /u/DutchTechJunkie [link] [comments]  ( 1 min )
    Introduction to Tensorflow.js with Real-World Example
    submitted by /u/RubiksCodeNMZ [link] [comments]
    Google AI Introduces A Method For Automating Inter- And Intra-Operator Parallelism For Distributed Deep Learning
    The memory capacity of single accelerators has swiftly overtaken the rapidly rising size of deep learning models in recent years. Earlier models, such as BERT (with a parameter size of 1GB), may scale across accelerators quickly by utilizing data parallelism, which duplicates model weights across accelerators while splitting and distributing training data. Recent huge models, such as GPT-3 (with a parameter size of 175GB), can only be scaled through model parallelism training, which involves partitioning a single model across many machines. While model parallelism solutions allow for the training of significant models, they are more challenging to implement since they must be tailored to the target neural networks and computing clusters. Megatron-LM, for example, splits weight matrices by rows or columns and then synchronizes the results across devices using a model parallelism technique. Different operators in a neural network are divided into several groups, and the input data is divided into micro-batches that are run in a pipelined method. Continue Reading Paper: https://arxiv.org/pdf/2201.12023.pdf Code: https://github.com/alpa-projects/alpa https://i.redd.it/6mu4lyihbey81.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Does anybody has at least some idea what this ai is called 🐶😳😏✨
    submitted by /u/p0goniphaft111 [link] [comments]  ( 1 min )
  • Open

    PPO exploration on hidden layers
    Hi everyone i have a question about PPO, but it can be extended to any stochastic policy. Imagine I have a DNN in which the first part is an MLP with 2 outputs, + 1 more additional layer (whatever it is). Usually PPO does exploration on the DNN final output by sampling the stochastic policy. The problem is that for my particular application i need the final layer to be deterministic without exploration, still i would like to explore on the 2 MLP output neurons MLP. Does this mess up with PPO loss? (log probabilities and entropy). submitted by /u/Tricky_Ad_853 [link] [comments]  ( 1 min )
    Whats the best RL lib?
    Whats the best RL lib or in general whats the best approach to write distributed RL like R2D2 etc? 1) stable baselines 2) Ray RLlib 3) Other 4) Write from scratch? What do you guys use to implement distributed RL? submitted by /u/Defiant_Sun5579 [link] [comments]  ( 1 min )
    When you have a recurrent policy, should you reset the hidden states at the end of each episode or after each step?
    submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Using CNNs in Reinforcement Learning
    For the last 18 months, I have been dabbling in DRL for both my MSc thesis and now also my PhD. I have however only worked with fixed size, 1 dimensional, state arrays for all of my problems and have used both baseline algorithms and adapted one version of SAC to better suit my problem needs. Note that even though I really like the main working principles of DRL and would love to dive deeper into the fundamentals, I do not have the required schooling in either mathematics or computer science to really afford going into that direction. Now currently I am working on a problem for which I think either CNNs or GNNs might be very useful but most of the implementations I can find only use the same network structure for the CNNs for the Actor and Critic (& Value) Networks, but they don't seem to be shared. Does this mean that all of these networks train different CNNs for similar purposes? Wouldn't it make sense if the CNN layers are shared for all of the networks, as in principle they share the same purpose: to transform the multi-dimensional input to a 1D input that contains sufficient information for the FC layers. It might be that I understand this completely incorrectly and this is already the case, and if its not the case, could someone explain to me why you wouldn't want to share these layers between the different networks? submitted by /u/jangroterder [link] [comments]  ( 3 min )
    Changing the learning rate for DQN agent in stable baselines 3
    Hi guys, I am pretty new to the field of reinforcement learning and I am going through it everyday. I made a custom environment with one state and 3 actions. I am trying to solve it with the dqn but it is taking too much time for example like 600k timesteps to get to the max result. So I thought of changing the learning rate. I could change the learning rate for PPO but it doesn't seem to be working for DQN. It would be really great if someone can help. Thanks in advance submitted by /u/last_2_brain_cells97 [link] [comments]  ( 1 min )
  • Open

    Logarithmic spiral
    I’ve seen an image similar to the following many times, but I don’t recall any source going into detail regarding how the spiral is constructed. This post will do just that. The previous post constructed iterated golden rectangles. We start with a golden rectangle and imagine chopping of first the blue, then the green, then […] Logarithmic spiral first appeared on John D. Cook.  ( 2 min )
    Iterated golden rectangles in detail
    I’ve seen the illustration of nesting golden rectangles many times, but I’ve never seen a presentation go into much detail. This post will go into more detail than usual, including Python code. Start with a golden rectangle in landscape mode. We’ll plot our rectangle with the lower left corner at the origin and the upper […] Iterated golden rectangles in detail first appeared on John D. Cook.  ( 2 min )
  • Open

    Utilize AWS AI services to automate content moderation and compliance
    The daily volume of third-party and user-generated content (UGC) across industries is increasing exponentially. Startups, social media, gaming, and other industries must ensure their customers are protected, while keeping operational costs down. Businesses in the broadcasting and media industries often find it difficult to efficiently add ratings to content pieces and formats to comply with […]  ( 6 min )
    Content moderation design patterns with AWS managed AI services
    User-generated content (UGC) grows exponentially, as well as the requirements and the cost to keep content and online communities safe and compliant. Modern web and mobile platforms fuel businesses and drive user engagement through social features, from startups to large organizations. Online community members expect safe and inclusive experiences where they can freely consume and […]  ( 8 min )
  • Open

    More Freedom on the Freeway: AI Lifts Malaysia’s Toll Barriers
    Working as an aerospace engineer in Malaysia, Chee How Lim dreamed of building a startup that could really take off. Today his company, Tapway, is riding a wave of computer vision and AI adoption in Southeast Asia. A call for help in 2019 with video analytics led to the Kuala Lumpur-based company’s biggest project to Read article > The post More Freedom on the Freeway: AI Lifts Malaysia’s Toll Barriers appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Q&A: Chris Rackauckas on the equations at the heart of practically everything
    Have a question about numerical differential equations? Odds are this CSAIL research affiliate has already addressed it.  ( 6 min )
  • Open

    Should the definition of Digital twins include simulation of complex systems?
    In the previous post, we discussed the various definitions of digital twins and we see that there are no shortage of them! But here is a question we discussed in class Should the definition of Digital twins include simulation of complex systems?  In my opinion, simulation is the raison d’être for digital twins let me explain (some of the ideas… Read More »Should the definition of Digital twins include simulation of complex systems? The post Should the definition of Digital twins include simulation of complex systems? appeared first on Data Science Central.  ( 2 min )
  • Open

    Introduction to Tensorflow.js with Real-World Example
    submitted by /u/RubiksCodeNMZ [link] [comments]
  • Open

    Static Analyzers in Python
    Static analyzers are tools that help you check your code without really running your code. The most basic form of static analyzers is the syntax highlighters in your favorite editors. If you need to compile your code (say, in C++), your compiler, such as LLVM, may also provide some static analyzer functions to warn you […] The post Static Analyzers in Python appeared first on Machine Learning Mastery.  ( 12 min )
  • Open

    Disentangled and Side-aware Unsupervised Domain Adaptation for Cross-dataset Subjective Tinnitus Diagnosis. (arXiv:2205.03230v1 [eess.SP])
    EEG-based tinnitus classification is a valuable tool for tinnitus diagnosis, research, and treatments. Most current works are limited to a single dataset where data patterns are similar. But EEG signals are highly non-stationary, resulting in model's poor generalization to new users, sessions or datasets. Thus, designing a model that can generalize to new datasets is beneficial and indispensable. To mitigate distribution discrepancy across datasets, we propose to achieve Disentangled and Side-aware Unsupervised Domain Adaptation (DSUDA) for cross-dataset tinnitus diagnosis. A disentangled auto-encoder is developed to decouple class-irrelevant information from the EEG signals to improve the classifying ability. The side-aware unsupervised domain adaptation module adapts the class-irrelevant information as domain variance to a new dataset and excludes the variance to obtain the class-distill features for the new dataset classification. It also align signals of left and right ears to overcome inherent EEG pattern difference. We compare DSUDA with state-of-the-art methods, and our model achieves significant improvements over competitors regarding comprehensive evaluation criteria. The results demonstrate our model can successfully generalize to a new dataset and effectively diagnose tinnitus.
    Predicting Loose-Fitting Garment Deformations Using Bone-Driven Motion Networks. (arXiv:2205.01355v2 [cs.GR] UPDATED)
    We present a learning algorithm that uses bone-driven motion networks to predict the deformation of loose-fitting garment meshes at interactive rates. Given a garment, we generate a simulation database and extract virtual bones from simulated mesh sequences using skin decomposition. At runtime, we separately compute low- and high-frequency deformations in a sequential manner. The low-frequency deformations are predicted by transferring body motions to virtual bones' motions, and the high-frequency deformations are estimated leveraging the global information of virtual bones' motions and local information extracted from low-frequency meshes. In addition, our method can estimate garment deformations caused by variations of the simulation parameters (e.g., fabric's bending stiffness) using an RBF kernel ensembling trained networks for different sets of simulation parameters. Through extensive comparisons, we show that our method outperforms state-of-the-art methods in terms of prediction accuracy of mesh deformations by about 20% in RMSE and 10% in Hausdorff distance and STED. The code and data are available at https://github.com/non-void/VirtualBones.
    Solar: $L_0$ solution path averaging for fast and accurate variable selection in high-dimensional data. (arXiv:2007.15707v3 [stat.ML] UPDATED)
    We propose a new variable selection algorithm, subsample-ordered least-angle regression (solar), and its coordinate descent generalization, solar-cd. Solar re-constructs lasso paths using the $L_0$ norm and averages the resulting solution paths across subsamples. Path averaging retains the ranking information of the informative variables while averaging out sensitivity to high dimensionality, improving variable selection stability, efficiency, and accuracy. We prove that: (i) with a high probability, path averaging perfectly separates informative variables from redundant variables on the average $L_0$ path; (ii) solar variable selection is consistent and accurate; and (iii) the probability that solar omits weak signals is controllable for finite sample size. We also demonstrate that: (i) solar yields, with less than $1/3$ of the lasso computation load, substantial improvements over lasso in terms of the sparsity (64-84\% reduction in redundant variable selection) and accuracy of variable selection; (ii) compared with the lasso safe/strong rule and variable screening, solar largely avoids selection of redundant variables and rejection of informative variables in the presence of complicated dependence structures; (iii) the sparsity and stability of solar conserves residual degrees of freedom for data-splitting hypothesis testing, improving the accuracy of post-selection inference on weak signals with limited $n$; (iv) replacing lasso with solar in bootstrap selection (e.g., bolasso or stability selection) produces a multi-layer variable ranking scheme that improves selection sparsity and ranking accuracy with the computation load of only one lasso realization; and (v) given the computation resources, solar bootstrap selection is substantially faster (98\% lower computation time) than the theoretical maximum speedup for parallelized bootstrap lasso (confirmed by Amdahl's law).
    A Framework for Evaluating Post Hoc Feature-Additive Explainers. (arXiv:2106.08376v2 [cs.LG] UPDATED)
    Many applications of data-driven models demand transparency of decisions, especially in health care, criminal justice, and other high-stakes environments. Modern trends in machine learning research have led to algorithms that are increasingly intricate to the degree that they are considered to be black boxes. In an effort to reduce the opacity of decisions, methods have been proposed to construe the inner workings of such models in a human-comprehensible manner. These post hoc techniques are described as being universal explainers - capable of faithfully augmenting decisions with algorithmic insight. Unfortunately, there is little agreement about what constitutes a "good" explanation. Moreover, current methods of explanation evaluation are derived from either subjective or proxy means. In this work, we propose a framework for the evaluation of post hoc explainers on ground truth that is directly derived from the additive structure of a model. We demonstrate the efficacy of the framework in understanding explainers by evaluating popular explainers on thousands of synthetic and several real-world tasks. The framework unveils that explanations may be accurate but misattribute the importance of individual features.
    Efficient and passive learning of networked dynamical systems driven by non-white exogenous inputs. (arXiv:2110.00852v3 [cs.LG] UPDATED)
    We consider a networked linear dynamical system with $p$ agents/nodes. We study the problem of learning the underlying graph of interactions/dependencies from observations of the nodal trajectories over a time-interval $T$. We present a regularized non-casual consistent estimator for this problem and analyze its sample complexity over two regimes: (a) where the interval $T$ consists of $n$ i.i.d. observation windows of length $T/n$ (restart and record), and (b) where $T$ is one continuous observation window (consecutive). Using the theory of $M$-estimators, we show that the estimator recovers the underlying interactions, in either regime, in a time-interval that is logarithmic in the system size $p$. To the best of our knowledge, this is the first work to analyze the sample complexity of learning linear dynamical systems \emph{driven by unobserved not-white wide-sense stationary (WSS) inputs}.  ( 2 min )
    Learning Optimal Conformal Classifiers. (arXiv:2110.09192v3 [cs.LG] UPDATED)
    Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's predictions, e.g., its probability estimates, to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on mini-batches during training. Compared to standard training, ConfTr reduces the average confidence set size (inefficiency) of state-of-the-art CP methods applied after training. Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.  ( 2 min )
    Differentially private training of residual networks with scale normalisation. (arXiv:2203.00324v2 [cs.LG] UPDATED)
    The training of neural networks with Differentially Private Stochastic Gradient Descent offers formal Differential Privacy guarantees but introduces accuracy trade-offs. In this work, we propose to alleviate these trade-offs in residual networks with Group Normalisation through a simple architectural modification termed ScaleNorm by which an additional normalisation layer is introduced after the residual block's addition operation. Our method allows us to further improve on the recently reported state-of-the art on CIFAR-10, achieving a top-1 accuracy of 82.5% ({\epsilon}=8.0) when trained from scratch.  ( 2 min )
    Active Offline Policy Selection. (arXiv:2106.10251v4 [cs.LG] UPDATED)
    This paper addresses the problem of policy selection in domains with abundant logged data, but with a restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and recommendation domains among others. Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation. Yet, large amounts of online interactions are often not possible in practice. To overcome this problem, we introduce active offline policy selection - a novel sequential decision approach that combines logged data with online interaction to identify the best policy. We use OPE estimates to warm start the online evaluation. Then, in order to utilize the limited environment interactions wisely we decide which policy to evaluate next based on a Bayesian optimization method with a kernel that represents policy similarity. We use multiple benchmarks, including real-world robotics, with a large number of candidate policies to show that the proposed approach improves upon state-of-the-art OPE estimates and pure online policy evaluation.  ( 2 min )
    What Makes A Good Fisherman? Linear Regression under Self-Selection Bias. (arXiv:2205.03246v1 [math.ST])
    In the classical setting of self-selection, the goal is to learn $k$ models, simultaneously from observations $(x^{(i)}, y^{(i)})$ where $y^{(i)}$ is the output of one of $k$ underlying models on input $x^{(i)}$. In contrast to mixture models, where we observe the output of a randomly selected model, here the observed model depends on the outputs themselves, and is determined by some known selection criterion. For example, we might observe the highest output, the smallest output, or the median output of the $k$ models. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics and applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium. In this work, we present the first computationally and statistically efficient estimation algorithms for the most standard setting of this problem where the models are linear. In the known-index case, we require poly$(1/\varepsilon, k, d)$ sample and time complexity to estimate all model parameters to accuracy $\varepsilon$ in $d$ dimensions, and can accommodate quite general selection criteria. In the more challenging unknown-index case, even the identifiability of the linear models (from infinitely many samples) was not known. We show three results in this case for the commonly studied $\max$ self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general $k$ we provide an algorithm with poly$(d) \exp(\text{poly}(k))$ sample and time complexity to estimate the regression parameters up to error $1/\text{poly}(k)$, and (3) for $k = 2$ we provide an algorithm for any error $\varepsilon$ and poly$(d, 1/\varepsilon)$ sample and time complexity.  ( 2 min )
    Federated Channel Learning for Intelligent Reflecting Surfaces With Fewer Pilot Signals. (arXiv:2205.03196v1 [eess.SP])
    Channel estimation is a critical task in intelligent reflecting surface (IRS)-assisted wireless systems due to the uncertainties imposed by environment dynamics and rapid changes in the IRS configuration. To deal with these uncertainties, deep learning (DL) approaches have been proposed. Previous works consider centralized learning (CL) approach for model training, which entails the collection of the whole training dataset from the users at the base station (BS), hence introducing huge transmission overhead for data collection. To address this challenge, this paper proposes a federated learning (FL) framework to jointly estimate both direct and cascaded channels in IRS-assisted wireless systems. We design a single convolutional neural network trained on the local datasets of the users without sending them to the BS. We show that the proposed FL-based channel estimation approach requires approximately 60% fewer pilot signals and it exhibits 12 times lower transmission overhead than CL, while maintaining satisfactory performance close to CL. In addition, it provides lower estimation error than the state-of-the-art DL-based schemes.  ( 2 min )
    Electrocardiographic Deep Learning for Predicting Post-Procedural Mortality. (arXiv:2205.03242v1 [eess.SP])
    Background. Pre-operative risk assessments used in clinical practice are limited in their ability to identify risk for post-operative mortality. We hypothesize that electrocardiograms contain hidden risk markers that can help prognosticate post-operative mortality. Methods. In a derivation cohort of 45,969 pre-operative patients (age 59+- 19 years, 55 percent women), a deep learning algorithm was developed to leverage waveform signals from pre-operative ECGs to discriminate post-operative mortality. Model performance was assessed in a holdout internal test dataset and in two external hospital cohorts and compared with the Revised Cardiac Risk Index (RCRI) score. Results. In the derivation cohort, there were 1,452 deaths. The algorithm discriminates mortality with an AUC of 0.83 (95% CI 0.79-0.87) surpassing the discrimination of the RCRI score with an AUC of 0.67 (CI 0.61-0.72) in the held out test cohort. Patients determined to be high risk by the deep learning model's risk prediction had an unadjusted odds ratio (OR) of 8.83 (5.57-13.20) for post-operative mortality as compared to an unadjusted OR of 2.08 (CI 0.77-3.50) for post-operative mortality for RCRI greater than 2. The deep learning algorithm performed similarly for patients undergoing cardiac surgery with an AUC of 0.85 (CI 0.77-0.92), non-cardiac surgery with an AUC of 0.83 (0.79-0.88), and catherization or endoscopy suite procedures with an AUC of 0.76 (0.72-0.81). The algorithm similarly discriminated risk for mortality in two separate external validation cohorts from independent healthcare systems with AUCs of 0.79 (0.75-0.83) and 0.75 (0.74-0.76) respectively. Conclusion. The findings demonstrate how a novel deep learning algorithm, applied to pre-operative ECGs, can improve discrimination of post-operative mortality.
    Convex Analysis at Infinity: An Introduction to Astral Space. (arXiv:2205.03260v1 [math.OC])
    Not all convex functions on $\mathbb{R}^n$ have finite minimizers; some can only be minimized by a sequence as it heads to infinity. In this work, we aim to develop a theory for understanding such minimizers at infinity. We study astral space, a compact extension of $\mathbb{R}^n$ to which such points at infinity have been added. Astral space is constructed to be as small as possible while still ensuring that all linear functions can be continuously extended to the new space. Although astral space includes all of $\mathbb{R}^n$, it is not a vector space, nor even a metric space. However, it is sufficiently well-structured to allow useful and meaningful extensions of concepts of convexity, conjugacy, and subdifferentials. We develop these concepts and analyze various properties of convex functions on astral space, including the detailed structure of their minimizers, exact characterizations of continuity, and convergence of descent algorithms.  ( 2 min )
    Implementation of a Binary Neural Network on a Passive Array of Magnetic Tunnel Junctions. (arXiv:2112.09159v2 [cs.ET] UPDATED)
    The increasing scale of neural networks and their growing application space have produced demand for more energy- and memory-efficient artificial-intelligence-specific hardware. Avenues to mitigate the main issue, the von Neumann bottleneck, include in-memory and near-memory architectures, as well as algorithmic approaches. Here we leverage the low-power and the inherently binary operation of magnetic tunnel junctions (MTJs) to demonstrate neural network hardware inference based on passive arrays of MTJs. In general, transferring a trained network model to hardware for inference is confronted by degradation in performance due to device-to-device variations, write errors, parasitic resistance, and nonidealities in the substrate. To quantify the effect of these hardware realities, we benchmark 300 unique weight matrix solutions of a 2-layer perceptron to classify the Wine dataset for both classification accuracy and write fidelity. Despite device imperfections, we achieve software-equivalent accuracy of up to 95.3 % with proper tuning of network parameters in 15 x 15 MTJ arrays having a range of device sizes. The success of this tuning process shows that new metrics are needed to characterize the performance and quality of networks reproduced in mixed signal hardware.  ( 2 min )
    Fine-tuning wav2vec2 for speaker recognition. (arXiv:2109.15053v2 [cs.SD] UPDATED)
    This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition. We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding. To adapt the framework to speaker recognition, we propose a single-utterance classification variant with CE or AAM softmax loss, and an utterance-pair classification variant with BCE loss. Our best performing variant, w2v2-aam, achieves a 1.88% EER on the extended voxceleb1 test set compared to 1.69% EER with an ECAPA-TDNN baseline. Code is available at https://github.com/nikvaessen/w2v2-speaker.  ( 2 min )
    Incremental Data-Uploading for Full-Quantum Classification. (arXiv:2205.03057v1 [quant-ph])
    The data representation in a machine-learning model strongly influences its performance. This becomes even more important for quantum machine learning models implemented on noisy intermediate scale quantum (NISQ) devices. Encoding high dimensional data into a quantum circuit for a NISQ device without any loss of information is not trivial and brings a lot of challenges. While simple encoding schemes (like single qubit rotational gates to encode high dimensional data) often lead to information loss within the circuit, complex encoding schemes with entanglement and data re-uploading lead to an increase in the encoding gate count. This is not well-suited for NISQ devices. This work proposes 'incremental data-uploading', a novel encoding pattern for high dimensional data that tackles these challenges. We spread the encoding gates for the feature vector of a given data point throughout the quantum circuit with parameterized gates in between them. This encoding pattern results in a better representation of data in the quantum circuit with a minimal pre-processing requirement. We show the efficiency of our encoding pattern on a classification task using the MNIST and Fashion-MNIST datasets, and compare different encoding methods via classification accuracy and the effective dimension of the model.  ( 2 min )
    Gaussian Processes for Missing Value Imputation. (arXiv:2204.04648v2 [stat.ML] UPDATED)
    Missing values are common in many real-life datasets. However, most of the current machine learning methods can not handle missing values. This means that they should be imputed beforehand. Gaussian Processes (GPs) are non-parametric models with accurate uncertainty estimates that combined with sparse approximations and stochastic variational inference scale to large data sets. Sparse GPs can be used to compute a predictive distribution for missing data. Here, we present a hierarchical composition of sparse GPs that is used to predict missing values at each dimension using all the variables from the other dimensions. We call the approach missing GP (MGP). MGP can be trained simultaneously to impute all observed missing values. Specifically, it outputs a predictive distribution for each missing value that is then used in the imputation of other missing values. We evaluate MGP in one private clinical data set and four UCI datasets with a different percentage of missing values. We compare the performance of MGP with other state-of-the-art methods for imputing missing values, including variants based on sparse GPs and deep GPs. The results obtained show a significantly better performance of MGP.  ( 2 min )
    Defending against Reconstruction Attacks through Differentially Private Federated Learning for Classification of Heterogeneous Chest X-Ray Data. (arXiv:2205.03168v1 [cs.LG])
    Privacy regulations and the physical distribution of heterogeneous data are often primary concerns for the development of deep learning models in a medical context. This paper evaluates the feasibility of differentially private federated learning for chest X-ray classification as a defense against privacy attacks on DenseNet121 and ResNet50 network architectures. We simulated a federated environment by distributing images from the public CheXpert and Mendeley chest X-ray datasets unevenly among 36 clients. Both non-private baseline models achieved an area under the ROC curve (AUC) of 0.94 on the binary classification task of detecting the presence of a medical finding. We demonstrate that both model architectures are vulnerable to privacy violation by applying image reconstruction attacks to local model updates from individual clients. The attack was particularly successful during later training stages. To mitigate the risk of privacy breach, we integrated R\'enyi differential privacy with a Gaussian noise mechanism into local model training. We evaluate model performance and attack vulnerability for privacy budgets $\epsilon \in$ {1, 3, 6, 10}. The DenseNet121 achieved the best utility-privacy trade-off with an AUC of 0.94 for $\epsilon$ = 6. Model performance deteriorated slightly for individual clients compared to the non-private baseline. The ResNet50 only reached an AUC of 0.76 in the same privacy setting. Its performance was inferior to that of the DenseNet121 for all considered privacy constraints, suggesting that the DenseNet121 architecture is more robust to differentially private training.  ( 2 min )
    Side-aware Meta-Learning for Cross-Dataset Listener Diagnosis with Subjective Tinnitus. (arXiv:2205.03231v1 [eess.SP])
    With the development of digital technology, machine learning has paved the way for the next generation of tinnitus diagnoses. Although machine learning has been widely applied in EEG-based tinnitus analysis, most current models are dataset-specific. Each dataset may be limited to a specific range of symptoms, overall disease severity, and demographic attributes; further, dataset formats may differ, impacting model performance. This paper proposes a side-aware meta-learning for cross-dataset tinnitus diagnosis, which can effectively classify tinnitus in subjects of divergent ages and genders from different data collection processes. Owing to the superiority of meta-learning, our method does not rely on large-scale datasets like conventional deep learning models. Moreover, we design a subject-specific training process to assist the model in fitting the data pattern of different patients or healthy people. Our method achieves a high accuracy of 73.8\% in the cross-dataset classification. We conduct an extensive analysis to show the effectiveness of side information of ears in enhancing model performance and side-aware meta-learning in improving the quality of the learned features.
    IMU Based Deep Stride Length Estimation With Self-Supervised Learning. (arXiv:2205.02977v1 [cs.LG])
    Stride length estimation using inertial measurement unit (IMU) sensors is getting popular recently as one representative gait parameter for health care and sports training. The traditional estimation method requires some explicit calibrations and design assumptions. Current deep learning methods suffer from few labeled data problem. To solve above problems, this paper proposes a single convolutional neural network (CNN) model to predict stride length of running and walking and classify the running or walking type per stride. The model trains its pretext task with self-supervised learning on a large unlabeled dataset for feature learning, and its downstream task on the stride length estimation and classification tasks with supervised learning with a small labeled dataset. The proposed model can achieve better average percent error, 4.78\%, on running and walking stride length regression and 99.83\% accuracy on running and walking classification, when compared to the previous approach, 7.44\% on the stride length estimation.  ( 2 min )
    Symphony: Learning Realistic and Diverse Agents for Autonomous Driving Simulation. (arXiv:2205.03195v1 [cs.LG])
    Simulation is a crucial tool for accelerating the development of autonomous vehicles. Making simulation realistic requires models of the human road users who interact with such cars. Such models can be obtained by applying learning from demonstration (LfD) to trajectories observed by cars already on the road. However, existing LfD methods are typically insufficient, yielding policies that frequently collide or drive off the road. To address this problem, we propose Symphony, which greatly improves realism by combining conventional policies with a parallel beam search. The beam search refines these policies on the fly by pruning branches that are unfavourably evaluated by a discriminator. However, it can also harm diversity, i.e., how well the agents cover the entire distribution of realistic behaviour, as pruning can encourage mode collapse. Symphony addresses this issue with a hierarchical approach, factoring agent behaviour into goal generation and goal conditioning. The use of such goals ensures that agent diversity neither disappears during adversarial training nor is pruned away by the beam search. Experiments on both proprietary and open Waymo datasets confirm that Symphony agents learn more realistic and diverse behaviour than several baselines.  ( 2 min )
    Trainable Wavelet Neural Network for Non-Stationary Signals. (arXiv:2205.03355v1 [cs.LG])
    This work introduces a wavelet neural network to learn a filter-bank specialized to fit non-stationary signals and improve interpretability and performance for digital signal processing. The network uses a wavelet transform as the first layer of a neural network where the convolution is a parameterized function of the complex Morlet wavelet. Experimental results, on both simplified data and atmospheric gravity waves, show the network is quick to converge, generalizes well on noisy data, and outperforms standard network architectures.
    Meta-Knowledge Transfer for Inductive Knowledge Graph Embedding. (arXiv:2110.14170v3 [cs.LG] UPDATED)
    Knowledge graphs (KGs) consisting of a large number of triples have become widespread recently, and many knowledge graph embedding (KGE) methods are proposed to embed entities and relations of a KG into continuous vector spaces. Such embedding methods simplify the operations of conducting various in-KG tasks (e.g., link prediction) and out-of-KG tasks (e.g., question answering). They can be viewed as general solutions for representing KGs. However, existing KGE methods are not applicable to inductive settings, where a model trained on source KGs will be tested on target KGs with entities unseen during model training. Existing works focusing on KGs in inductive settings can only solve the inductive relation prediction task. They can not handle other out-of-KG tasks as general as KGE methods since they don't produce embeddings for entities. In this paper, to achieve inductive knowledge graph embedding, we propose a model MorsE, which does not learn embeddings for entities but learns transferable meta-knowledge that can be used to produce entity embeddings. Such meta-knowledge is modeled by entity-independent modules and learned by meta-learning. Experimental results show that our model significantly outperforms corresponding baselines for in-KG and out-of-KG tasks in inductive settings.  ( 2 min )
    Synthetic Data -- what, why and how?. (arXiv:2205.03257v1 [cs.LG])
    This explainer document aims to provide an overview of the current state of the rapidly expanding work on synthetic data technologies, with a particular focus on privacy. The article is intended for a non-technical audience, though some formal definitions have been given to provide clarity to specialists. This article is intended to enable the reader to quickly become familiar with the notion of synthetic data, as well as understand some of the subtle intricacies that come with it. We do believe that synthetic data is a very useful tool, and our hope is that this report highlights that, while drawing attention to nuances that can easily be overlooked in its deployment.  ( 2 min )
    Application of Clustering Algorithms for Dimensionality Reduction in Infrastructure Resilience Prediction Models. (arXiv:2205.03316v1 [cs.LG])
    Recent studies increasingly adopt simulation-based machine learning (ML) models to analyze critical infrastructure system resilience. For realistic applications, these ML models consider the component-level characteristics that influence the network response during emergencies. However, such an approach could result in a large number of features and cause ML models to suffer from the `curse of dimensionality'. We present a clustering-based method that simultaneously minimizes the problem of high-dimensionality and improves the prediction accuracy of ML models developed for resilience analysis in large-scale interdependent infrastructure networks. The methodology has three parts: (a) generation of simulation dataset, (b) network component clustering, and (c) dimensionality reduction and development of prediction models. First, an interdependent infrastructure simulation model simulates the network-wide consequences of various disruptive events. The component-level features are extracted from the simulated data. Next, clustering algorithms are used to derive the cluster-level features by grouping component-level features based on their topological and functional characteristics. Finally, ML algorithms are used to develop models that predict the network-wide impacts of disruptive events using the cluster-level features. The applicability of the method is demonstrated using an interdependent power-water-transport testbed. The proposed method can be used to develop decision-support tools for post-disaster recovery of infrastructure networks.
    Physics-informed neural networks for PDE-constrained optimization and control. (arXiv:2205.03377v1 [cs.LG])
    A fundamental problem of science is designing optimal control policies that manipulate a given environment into producing a desired outcome. Control Physics-Informed Neural Networks simultaneously solve a given system state, and its respective optimal control, in a one-stage framework that conforms to physical laws of the system. Prior approaches use a two-stage framework that models and controls a system sequentially, whereas Control PINNs incorporates the required optimality conditions in its architecture and loss function. The success of Control PINNs is demonstrated by solving the following open-loop optimal control problems: (i) an analytical problem (ii) a one-dimensional heat equation, and (iii) a two-dimensional predator-prey problem.  ( 2 min )
    Generative Adversarial Neural Operators. (arXiv:2205.03017v1 [cs.LG])
    We propose the generative adversarial neural operator (GANO), a generative model paradigm for learning probabilities on infinite-dimensional function spaces. The natural sciences and engineering are known to have many types of data that are sampled from infinite-dimensional function spaces, where classical finite-dimensional deep generative adversarial networks (GANs) may not be directly applicable. GANO generalizes the GAN framework and allows for the sampling of functions by learning push-forward operator maps in infinite-dimensional spaces. GANO consists of two main components, a generator neural operator and a discriminator neural functional. The inputs to the generator are samples of functions from a user-specified probability measure, e.g., Gaussian random field (GRF), and the generator outputs are synthetic data functions. The input to the discriminator is either a real or synthetic data function. In this work, we instantiate GANO using the Wasserstein criterion and show how the Wasserstein loss can be computed in infinite-dimensional spaces. We empirically study GANOs in controlled cases where both input and output functions are samples from GRFs and compare its performance to the finite-dimensional counterpart GAN. We empirically study the efficacy of GANO on real-world function data of volcanic activities and show its superior performance over GAN. Furthermore, we find that for the function-based data considered, GANOs are more stable to train than GANs and require less hyperparameter optimization.
    Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation. (arXiv:2205.01133v2 [cs.CL] UPDATED)
    Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the quality of translations. The visual information can serve as a valuable piece of context information to decrease the ambiguity of input sentences. Despite the increasing popularity of such a technique, good and sizeable datasets are scarce, limiting the full extent of their potential. Hausa, a Chadic language, is a member of the Afro-Asiatic language family. It is estimated that about 100 to 150 million people speak the language, with more than 80 million indigenous speakers. This is more than any of the other Chadic languages. Despite a large number of speakers, the Hausa language is considered low-resource in natural language processing (NLP). This is due to the absence of sufficient resources to implement most NLP tasks. While some datasets exist, they are either scarce, machine-generated, or in the religious domain. Therefore, there is a need to create training and evaluation data for implementing machine learning tasks and bridging the research gap in the language. This work presents the Hausa Visual Genome (HaVG), a dataset that contains the description of an image or a section within the image in Hausa and its equivalent in English. To prepare the dataset, we started by translating the English description of the images in the Hindi Visual Genome (HVG) into Hausa automatically. Afterward, the synthetic Hausa data was carefully post-edited considering the respective images. The dataset comprises 32,923 images and their descriptions that are divided into training, development, test, and challenge test set. The Hausa Visual Genome is the first dataset of its kind and can be used for Hausa-English machine translation, multi-modal research, and image description, among various other natural language processing and generation tasks.  ( 3 min )
    Atlas-powered deep learning (ADL) -- application to diffusion weighted MRI. (arXiv:2205.03210v1 [physics.med-ph])
    Deep learning has a great potential for estimating biomarkers in diffusion weighted magnetic resonance imaging (dMRI). Atlases, on the other hand, are a unique tool for modeling the spatio-temporal variability of biomarkers. In this paper, we propose the first framework to exploit both deep learning and atlases for biomarker estimation in dMRI. Our framework relies on non-linear diffusion tensor registration to compute biomarker atlases and to estimate atlas reliability maps. We also use nonlinear tensor registration to align the atlas to a subject and to estimate the error of this alignment. We use the biomarker atlas, atlas reliability map, and alignment error map, in addition to the dMRI signal, as inputs to a deep learning model for biomarker estimation. We use our framework to estimate fractional anisotropy and neurite orientation dispersion from down-sampled dMRI data on a test cohort of 70 newborn subjects. Results show that our method significantly outperforms standard estimation methods as well as recent deep learning techniques. Our method is also more robust to stronger measurement down-sampling factors. Our study shows that the advantages of deep learning and atlases can be synergistically combined to achieve unprecedented accuracy in biomarker estimation from dMRI data.
    Efficient Minimax Optimal Estimators For Multivariate Convex Regression. (arXiv:2205.03368v1 [math.ST])
    We study the computational aspects of the task of multivariate convex regression in dimension $d \geq 5$. We present the first computationally efficient minimax optimal (up to logarithmic factors) estimators for the tasks of (i) $L$-Lipschitz convex regression (ii) $\Gamma$-bounded convex regression under polytopal support. The proof of the correctness of these estimators uses a variety of tools from different disciplines, among them empirical process theory, stochastic geometry, and potential theory. This work is the first to show the existence of efficient minimax optimal estimators for non-Donsker classes that their corresponding Least Squares Estimators are provably minimax sub-optimal; a result of independent interest.  ( 2 min )
    Optimal Control as Variational Inference. (arXiv:2205.03279v1 [cs.LG])
    In this article we address the stochastic and risk sensitive optimal control problem probabilistically and decompose and solve the probabilistic models using principles from variational inference. We demonstrate how this culminates into two separate probabilistic inference procedures that allow to iteratively infer the deterministic optimal policy. More formally a sequence of belief policies, as a probabilistic proxy for the deterministic optimal policy, is specified through a fixed point iteration with the equilibrium point coinciding with the deterministic solution. These results re-establish the paradigm of Control as Inference, a concept explored and exploited originally by the Reinforcement Learning community anticipating deep rooted connections between optimal estimation and control. Although the Control as Inference paradigm already resulted in the development of several Reinforcement Learning algorithms, until now the underlying mechanism were only partially understood. For that very reason control as inference has not been well received by the control community. By exposing the underlying mechanism we aim to contribute to its general acceptance as a framework superseding optimal control. In order to exhibit its general relevance we discuss parallels with path integral control and discuss a wide range of possible applications.  ( 2 min )
    A Highly Adaptive Acoustic Model for Accurate Multi-Dialect Speech Recognition. (arXiv:2205.03027v1 [cs.LG])
    Despite the success of deep learning in speech recognition, multi-dialect speech recognition remains a difficult problem. Although dialect-specific acoustic models are known to perform well in general, they are not easy to maintain when dialect-specific data is scarce and the number of dialects for each language is large. Therefore, a single unified acoustic model (AM) that generalizes well for many dialects has been in demand. In this paper, we propose a novel acoustic modeling technique for accurate multi-dialect speech recognition with a single AM. Our proposed AM is dynamically adapted based on both dialect information and its internal representation, which results in a highly adaptive AM for handling multiple dialects simultaneously. We also propose a simple but effective training method to deal with unseen dialects. The experimental results on large scale speech datasets show that the proposed AM outperforms all the previous ones, reducing word error rates (WERs) by 8.11% relative compared to a single all-dialects AM and by 7.31% relative compared to dialect-specific AMs.  ( 2 min )
    Predicting Chemical Hazard across Taxa through Machine Learning. (arXiv:2110.03688v3 [q-bio.QM] UPDATED)
    We applied machine learning methods to predict chemical hazards focusing on fish acute toxicity across taxa. We analyzed the relevance of taxonomy and experimental setup, showing that taking them into account can lead to considerable improvements in the classification performance. We quantified the gain obtained throught the introduction of taxonomic and experimental information, compared to classification based on chemical information alone. We used our approach with standard machine learning models (K-nearest neighbors, random forests and deep neural networks), as well as the recently proposed Read-Across Structure Activity Relationship (RASAR) models, which were very successful in predicting chemical hazards to mammals based on chemical similarity. We were able to obtain accuracies of over 93% on datasets where, due to noise in the data, the maximum achievable accuracy was expected to be below 96%. The best performances were obtained by random forests and RASAR models. We analyzed metrics to compare our results with animal test reproducibility, and despite most of our models "outperform animal test reproducibility" as measured through recently proposed metrics, we showed that the comparison between machine learning performance and animal test reproducibility should be addressed with particular care. While we focused on fish mortality, our approach, provided that the right data is available, is valid for any combination of chemicals, effects and taxa.  ( 2 min )
    Altering backward pass gradients improves convergence. (arXiv:2111.12495v2 [cs.LG] UPDATED)
    In typical neural network training, the gradients in the backward pass is determined by the forward pass. As a result, the two stages are coupled. However, it is often seen that neural networks perform worse when gradients explode or decline. To address this, numerous approaches like Gradient Clipping (GC) and Adaptive Gradient Clipping (AGC) have been developed to enhance the gradient behaviour of networks without normalization layers during backward passes. These techniques decouple the backward and forward passes and modify the gradients adaptively. A possible drawback of clipping approaches is that they must be calculated for each weight tensor in each layer. We offer the PowerGrad Transform (PGT), a comparable approach that alters and enhances the gradient flow behaviour in the backward pass but is calculated only in the final softmax layer. It is very computationally efficient and outperforms both GC and AGC, resulting in improved performance in networks without batch normalization. PGT is easy to integrate into existing networks, requiring just a few lines of code, and significantly increases performance in non-BN ResNets. The impact is more pronounced on big datasets like as ImageNet, when networks do not fit all of the training data and there is some training headroom. PGT makes it possible for the network to better fit the training data while simultaneously improving its performance on the test set.  ( 2 min )
    Longitudinal cardio-respiratory fitness prediction through free-living wearable sensors. (arXiv:2205.03116v1 [cs.LG])
    Cardiorespiratory fitness is an established predictor of metabolic disease and mortality. Fitness is directly measured as maximal oxygen consumption (VO2max), or indirectly assessed using heart rate response to a standard exercise test. However, such testing is costly and burdensome, limiting its utility and scalability. Fitness can also be approximated using resting heart rate and self-reported exercise habits but with lower accuracy. Modern wearables capture dynamic heart rate data which, in combination with machine learning models, could improve fitness prediction. In this work, we analyze movement and heart rate signals from wearable sensors in free-living conditions from 11,059 participants who also underwent a standard exercise test, along with a longitudinal repeat cohort of 2,675 participants. We design algorithms and models that convert raw sensor data into cardio-respiratory fitness estimates, and validate these estimates' ability to capture fitness profiles in a longitudinal cohort over time while subjects engaged in real-world (non-exercise) behaviour. Additionally, we validate our methods with a third external cohort of 181 participants who underwent maximal VO2max testing, which is considered the gold standard measurement because it requires reaching one's maximum heart rate and exhaustion level. Our results show that the developed models yield a high correlation (r = 0.82, 95CI 0.80-0.83), when compared to the ground truth in a holdout sample. These models outperform conventional non-exercise fitness models and traditional bio-markers using measurements of normal daily living without the need for a specific exercise test. Additionally, we show the adaptability and applicability of this approach for detecting fitness change over time in the longitudinal subsample that repeated measurements after 7 years.  ( 2 min )
    Offense Detection in Dravidian Languages using Code-Mixing Index based Focal Loss. (arXiv:2111.06916v2 [cs.CL] UPDATED)
    Over the past decade, we have seen exponential growth in online content fueled by social media platforms. Data generation of this scale comes with the caveat of insurmountable offensive content in it. The complexity of identifying offensive content is exacerbated by the usage of multiple modalities (image, language, etc.), code-mixed language and more. Moreover, even after careful sampling and annotation of offensive content, there will always exist a significant class imbalance between offensive and non-offensive content. In this paper, we introduce a novel Code-Mixing Index (CMI) based focal loss which circumvents two challenges (1) code-mixing in languages (2) class imbalance problem for Dravidian language offense detection. We also replace the conventional dot product-based classifier with the cosine-based classifier which results in a boost in performance. Further, we use multilingual models that help transfer characteristics learnt across languages to work effectively with low resourced languages. It is also important to note that our model handles instances of mixed script (say usage of Latin and Dravidian-Tamil script) as well. To summarize, our model can handle offensive language detection in a low-resource, class imbalanced, multilingual and code-mixed setting.  ( 2 min )
    SKILL-IL: Disentangling Skill and Knowledge in Multitask Imitation Learning. (arXiv:2205.03130v1 [cs.LG])
    In this work, we introduce a new perspective for learning transferable content in multi-task imitation learning. Humans are able to transfer skills and knowledge. If we can cycle to work and drive to the store, we can also cycle to the store and drive to work. We take inspiration from this and hypothesize the latent memory of a policy network can be disentangled into two partitions. These contain either the knowledge of the environmental context for the task or the generalizable skill needed to solve the task. This allows improved training efficiency and better generalization over previously unseen combinations of skills in the same environment, and the same task in unseen environments. We used the proposed approach to train a disentangled agent for two different multi-task IL environments. In both cases we out-performed the SOTA by 30% in task success rate. We also demonstrated this for navigation on a real robot.  ( 2 min )
    Echocardiography Segmentation with Enforced Temporal Consistency. (arXiv:2112.02102v2 [eess.IV] UPDATED)
    Convolutional neural networks (CNN) have demonstrated their ability to segment 2D cardiac ultrasound images. However, despite recent successes according to which the intra-observer variability on end-diastole and end-systole images has been reached, CNNs still struggle to leverage temporal information to provide accurate and temporally consistent segmentation maps across the whole cycle. Such consistency is required to accurately describe the cardiac function, a necessary step in diagnosing many cardiovascular diseases. In this paper, we propose a framework to learn the 2D+time apical long-axis cardiac shape such that the segmented sequences can benefit from temporal and anatomical consistency constraints. Our method is a post-processing that takes as input segmented echocardiographic sequences produced by any state-of-the-art method and processes it in two steps to (i) identify spatio-temporal inconsistencies according to the overall dynamics of the cardiac sequence and (ii) correct the inconsistencies. The identification and correction of cardiac inconsistencies relies on a constrained autoencoder trained to learn a physiologically interpretable embedding of cardiac shapes, where we can both detect and fix anomalies. We tested our framework on 98 full-cycle sequences from the CAMUS dataset, which are available alongside this paper. Our temporal regularization method not only improves the accuracy of the segmentation across the whole sequences, but also enforces temporal and anatomical consistency.  ( 2 min )
    Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization. (arXiv:2205.02976v1 [cs.LG])
    We extend the idea underlying the success of green simulation assisted policy gradient (GS-PG) to partial historical trajectory reuse for infinite-horizon Markov Decision Processes (MDP). The existing GS-PG method was designed to learn from complete episodes or process trajectories, which limits its applicability to low-data environment and online process control. In this paper, the mixture likelihood ratio (MLR) based policy gradient estimation is used to leverage the information from historical state decision transitions generated under different behavioral policies. We propose a variance reduction experience replay (VRER) approach that can intelligently select and reuse most relevant transition observations, improve the policy gradient estimation accuracy, and accelerate the learning of optimal policy. Then we create a process control strategy by incorporating VRER with the state-of-the-art step-based policy optimization approaches such as actor-critic method and proximal policy optimizations. The empirical study demonstrates that the proposed policy gradient methodology can significantly outperform the existing policy optimization approaches.  ( 2 min )
    Transferring Adversarial Robustness Through Robust Representation Matching. (arXiv:2202.09994v2 [cs.LG] UPDATED)
    With the widespread use of machine learning, concerns over its security and reliability have become prevalent. As such, many have developed defenses to harden neural networks against adversarial examples, imperceptibly perturbed inputs that are reliably misclassified. Adversarial training in which adversarial examples are generated and used during training is one of the few known defenses able to reliably withstand such attacks against neural networks. However, adversarial training imposes a significant training overhead and scales poorly with model complexity and input dimension. In this paper, we propose Robust Representation Matching (RRM), a low-cost method to transfer the robustness of an adversarially trained model to a new model being trained for the same task irrespective of architectural differences. Inspired by student-teacher learning, our method introduces a novel training loss that encourages the student to learn the teacher's robust representations. Compared to prior works, RRM is superior with respect to both model performance and adversarial training time. On CIFAR-10, RRM trains a robust model $\sim 1.8\times$ faster than the state-of-the-art. Furthermore, RRM remains effective on higher-dimensional datasets. On Restricted-ImageNet, RRM trains a ResNet50 model $\sim 18\times$ faster than standard adversarial training.  ( 2 min )
    Bayesian Sample Size Prediction for Online Activity. (arXiv:2111.12157v2 [stat.ML] UPDATED)
    In many contexts it is useful to predict the number of individuals in some population who will initiate a particular activity during a given period. For example, the number of users who will install a software update, the number of customers who will use a new feature on a website or who will participate in an A/B test. In practical settings, there is heterogeneity amongst individuals with regard to the distribution of time until they will initiate. For these reasons it is inappropriate to assume that the number of new individuals observed on successive days will be identically distributed. Given observations on the number of unique users participating in an initial period, we present a simple but novel Bayesian method for predicting the number of additional individuals who will participate during a subsequent period. We illustrate the performance of the method in predicting sample size in online experimentation.  ( 2 min )
    Journaling Data for Daily PHQ-2 Depression Prediction and Forecasting. (arXiv:2205.03391v1 [cs.LG])
    Digital health applications are becoming increasingly important for assessing and monitoring the wellbeing of people suffering from mental health conditions like depression. A common target of said applications is to predict the results of self-assessed Patient-Health-Questionnaires (PHQ), indicating current symptom severity of depressive individuals. In this work, we explore the potential of using actively-collected data to predict and forecast daily PHQ-2 scores on a newly-collected longitudinal dataset. We obtain a best MAE of 1.417 for daily prediction of PHQ-2 scores, which specifically in the used dataset have a range of 0 to 12, using leave-one-subject-out cross-validation, as well as a best MAE of 1.914 for forecasting PHQ-2 scores using data from up to the last 7 days. This illustrates the additive value that can be obtained by incorporating actively-collected data in a depression monitoring application.  ( 2 min )
    Differentially Private Generalized Linear Models Revisited. (arXiv:2205.03014v1 [cs.LG])
    We study the problem of $(\epsilon,\delta)$-differentially private learning of linear predictors with convex losses. We provide results for two subclasses of loss functions. The first case is when the loss is smooth and non-negative but not necessarily Lipschitz (such as the squared loss). For this case, we establish an upper bound on the excess population risk of $\tilde{O}\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^* \Vert^2}{(n\epsilon)^{2/3}},\frac{\sqrt{d}\Vert w^*\Vert^2}{n\epsilon}\right\}\right)$, where $n$ is the number of samples, $d$ is the dimension of the problem, and $w^*$ is the minimizer of the population risk. Apart from the dependence on $\Vert w^\ast\Vert$, our bound is essentially tight in all parameters. In particular, we show a lower bound of $\tilde{\Omega}\left(\frac{1}{\sqrt{n}} + {\min\left\{\frac{\Vert w^*\Vert^{4/3}}{(n\epsilon)^{2/3}}, \frac{\sqrt{d}\Vert w^*\Vert}{n\epsilon}\right\}}\right)$. We also revisit the previously studied case of Lipschitz losses [SSTT20]. For this case, we close the gap in the existing work and show that the optimal rate is (up to log factors) $\Theta\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^*\Vert}{\sqrt{n\epsilon}},\frac{\sqrt{\text{rank}}\Vert w^*\Vert}{n\epsilon}\right\}\right)$, where $\text{rank}$ is the rank of the design matrix. This improves over existing work in the high privacy regime. Finally, our algorithms involve a private model selection approach that we develop to enable attaining the stated rates without a-priori knowledge of $\Vert w^*\Vert$.  ( 2 min )
    Multichannel Synthetic Preictal EEG Signals to Enhance the Prediction of Epileptic Seizures. (arXiv:2205.03239v1 [eess.SP])
    Epilepsy is a chronic neurological disorder affecting 1\% of people worldwide, deep learning (DL) algorithms-based electroencephalograph (EEG) analysis provides the possibility for accurate epileptic seizure (ES) prediction, thereby benefiting patients suffering from epilepsy. To identify the preictal region that precedes the onset of seizure, a large number of annotated EEG signals are required to train DL algorithms. However, the scarcity of seizure onsets leads to significant insufficiency of data for training the DL algorithms. To overcome this data insufficiency, in this paper, we propose a preictal artificial signal synthesis algorithm based on a generative adversarial network to generate synthetic multichannel EEG preictal samples. A high-quality single-channel architecture, determined by visual and statistical evaluations, is used to train the generators of multichannel samples. The effectiveness of the synthetic samples is evaluated by comparing the ES prediction performances without and with synthetic preictal sample augmentation. The leave-one-seizure-out cross validation ES prediction accuracy and corresponding area under the receiver operating characteristic curve evaluation improve from 73.0\% and 0.676 to 78.0\% and 0.704 by 10$\times$ synthetic sample augmentation, respectively. The obtained results indicate that synthetic preictal samples are effective for enhancing ES prediction performance.  ( 2 min )
    Over-the-Air Federated Multi-Task Learning Over MIMO Multiple Access Channels. (arXiv:2112.13603v2 [eess.SP] UPDATED)
    With the explosive growth of data and wireless devices, federated learning (FL) over wireless medium has emerged as a promising technology for large-scale distributed intelligent systems. Yet, the urgent demand for ubiquitous intelligence will generate a large number of concurrent FL tasks, which may seriously aggravate the scarcity of communication resources. By exploiting the analog superposition of electromagnetic waves, over-the-air computation (AirComp) is an appealing solution to alleviate the burden of communication required by FL. However, sharing frequency-time resources in over-the-air computation inevitably brings about the problem of inter-task interference, which poses a new challenge that needs to be appropriately addressed. In this paper, we study over-the-air federated multi-task learning (OA-FMTL) over the multiple-input multiple-output (MIMO) multiple access (MAC) channel. We propose a novel model aggregation method for the alignment of local gradients of different devices, which alleviates the straggler problem in over-the-air computation due to the channel heterogeneity. We establish a communication-learning analysis framework for the proposed OA-FMTL scheme by considering the spatial correlation between devices, and formulate an optimization problem for the design of transceiver beamforming and device selection. To solve this problem, we develop an algorithm by using alternating optimization (AO) and fractional programming (FP), which effectively mitigates the impact of inter-task interference on the FL learning performance. We show that due to the use of the new model aggregation method, device selection is no longer essential, thereby avoiding the heavy computational burden involved in selecting active devices. Numerical results demonstrate the validity of the analysis and the superb performance of the proposed scheme.  ( 2 min )
    Transferring Chemical and Energetic Knowledge Between Molecular Systems with Machine Learning. (arXiv:2205.03339v1 [physics.chem-ph])
    Predicting structural and energetic properties of a molecular system is one of the fundamental tasks in molecular simulations, and it has use cases in chemistry, biology, and medicine. In the past decade, the advent of machine learning algorithms has impacted on molecular simulations for various tasks, including property prediction of atomistic systems. In this paper, we propose a novel methodology for transferring knowledge obtained from simple molecular systems to a more complex one, possessing a significantly larger number of atoms and degrees of freedom. In particular, we focus on the classification of high and low free-energy states. Our approach relies on utilizing (i) a novel hypergraph representation of molecules, encoding all relevant information for characterizing the potential energy of a conformation, and (ii) novel message passing and pooling layers for processing and making predictions on such hypergraph-structured data. Despite the complexity of the problem, our results show a remarkable AUC of 0.92 for transfer learning from tri-alanine to the deca-alanine system. Moreover, we show that the very same transfer learning approach can be used to group, in an unsupervised way, various secondary structures of deca-alanine in clusters having similar free-energy values. Our study represents a proof of concept that reliable transfer learning models for molecular systems can be designed paving the way to unexplored routes in prediction of structural and energetic properties of biologically relevant systems.  ( 2 min )
    Optimally tackling covariate shift in RKHS-based nonparametric regression. (arXiv:2205.02986v1 [math.ST])
    We study the covariate shift problem in the context of nonparametric regression over a reproducing kernel Hilbert space (RKHS). We focus on two natural families of covariate shift problems defined using the likelihood ratios between the source and target distributions. When the likelihood ratios are uniformly bounded, we prove that the kernel ridge regression (KRR) estimator with a carefully chosen regularization parameter is minimax rate-optimal (up to a log factor) for a large family of RKHSs with regular kernel eigenvalues. Interestingly, KRR does not require full knowledge of the likelihood ratio apart from an upper bound on it. In striking contrast to the standard statistical setting without covariate shift, we also demonstrate that a na\"\i ve estimator, which minimizes the empirical risk over the function class, is strictly suboptimal under covariate shift as compared to KRR. We then address the larger class of covariate shift problems where likelihood ratio is possibly unbounded yet has a finite second moment. Here, we show via careful simulations that KRR fails to attain the optimal rate. Instead, we propose a reweighted KRR estimator that weights samples based on a careful truncation of the likelihood ratios. Again, we are able to show that this estimator is minimax optimal, up to logarithmic factors.  ( 2 min )
    Benchmarking Econometric and Machine Learning Methodologies in Nowcasting. (arXiv:2205.03318v1 [stat.ML])
    Nowcasting can play a key role in giving policymakers timelier insight to data published with a significant time lag, such as final GDP figures. Currently, there are a plethora of methodologies and approaches for practitioners to choose from. However, there lacks a comprehensive comparison of these disparate approaches in terms of predictive performance and characteristics. This paper addresses that deficiency by examining the performance of 12 different methodologies in nowcasting US quarterly GDP growth, including all the methods most commonly employed in nowcasting, as well as some of the most popular traditional machine learning approaches. Performance was assessed on three different tumultuous periods in US economic history: the early 1980s recession, the 2008 financial crisis, and the COVID crisis. The two best performing methodologies in the analysis were long short-term memory artificial neural networks (LSTM) and Bayesian vector autoregression (BVAR). To facilitate further application and testing of each of the examined methodologies, an open-source repository containing boilerplate code that can be applied to different datasets is published alongside the paper, available at: github.com/dhopp1/nowcasting_benchmark.
    Perseus: A Simple High-Order Regularization Method for Variational Inequalities. (arXiv:2205.03202v1 [math.OC])
    This paper settles an open and challenging question pertaining to the design of simple high-order regularization methods for solving smooth and monotone variational inequalities (VIs). A VI involves finding $x^\star \in \mathcal{X}$ such that $\langle F(x), x - x^\star\rangle \geq 0$ for all $x \in \mathcal{X}$ and we consider the setting where $F: \mathbb{R}^d \mapsto \mathbb{R}^d$ is smooth with up to $(p-1)^{th}$-order derivatives. For the case of $p = 2$,~\citet{Nesterov-2006-Constrained} extended the cubic regularized Newton's method to VIs with a global rate of $O(\epsilon^{-1})$. \citet{Monteiro-2012-Iteration} proposed another second-order method which achieved an improved rate of $O(\epsilon^{-2/3}\log(1/\epsilon))$, but this method required a nontrivial binary search procedure as an inner loop. High-order methods based on similar binary search procedures have been further developed and shown to achieve a rate of $O(\epsilon^{-2/(p+1)}\log(1/\epsilon))$. However, such search procedure can be computationally prohibitive in practice and the problem of finding a simple high-order regularization methods remains as an open and challenging question in optimization theory. We propose a $p^{th}$-order method which does \textit{not} require any binary search scheme and is guaranteed to converge to a weak solution with a global rate of $O(\epsilon^{-2/(p+1)})$. A version with restarting attains a global linear and local superlinear convergence rate for smooth and strongly monotone VIs. Further, our method achieves a global rate of $O(\epsilon^{-2/p})$ for solving smooth and non-monotone VIs satisfying the Minty condition; moreover, the restarted version again attains a global linear and local superlinear convergence rate if the strong Minty condition holds.
    Out-of-Distribution Detection for Medical Applications: Guidelines for Practical Evaluation. (arXiv:2109.14885v2 [cs.LG] UPDATED)
    Detection of Out-of-Distribution (OOD) samples in real time is a crucial safety check for deployment of machine learning models in the medical field. Despite a growing number of uncertainty quantification techniques, there is a lack of evaluation guidelines on how to select OOD detection methods in practice. This gap impedes implementation of OOD detection methods for real-world applications. Here, we propose a series of practical considerations and tests to choose the best OOD detector for a specific medical dataset. These guidelines are illustrated on a real-life use case of Electronic Health Records (EHR). Our results can serve as a guide for implementation of OOD detection methods in clinical practice, mitigating risks associated with the use of machine learning models in healthcare.
    Tensor Principal Component Analysis in High Dimensional CP Models. (arXiv:2108.04428v3 [stat.ML] UPDATED)
    The CP decomposition for high dimensional non-orthogonal spiked tensors is an important problem with broad applications across many disciplines. However, previous works with theoretical guarantee typically assume restrictive incoherence conditions on the basis vectors for the CP components. In this paper, we propose new computationally efficient composite PCA and concurrent orthogonalization algorithms for tensor CP decomposition with theoretical guarantees under mild incoherence conditions. The composite PCA applies the principal component or singular value decompositions twice, first to a matrix unfolding of the tensor data to obtain singular vectors and then to the matrix folding of the singular vectors obtained in the first step. It can be used as an initialization for any iterative optimization schemes for the tensor CP decomposition. The concurrent orthogonalization algorithm iteratively estimates the basis vector in each mode of the tensor by simultaneously applying projections to the orthogonal complements of the spaces generated by other CP components in other modes. It is designed to improve the alternating least squares estimator and other forms of the high order orthogonal iteration for tensors with low or moderately high CP ranks, and it is guaranteed to converge rapidly when the error of any given initial estimator is bounded by a small constant. Our theoretical investigation provides estimation accuracy and convergence rates for the two proposed algorithms. Both proposed algorithms are applicable to deterministic tensor, its noisy version, and the order-$2K$ covariance tensor of order-$K$ tensor data in a factor model with uncorrelated factors. Our implementations on synthetic data demonstrate significant practical superiority of our approach over existing methods.
    Transformer Embeddings of Irregularly Spaced Events and Their Participants. (arXiv:2201.00044v3 [cs.LG] UPDATED)
    The neural Hawkes process (Mei & Eisner, 2017) is a generative model of irregularly spaced sequences of discrete events. To handle complex domains with many event types, Mei et al. (2020a) further consider a setting in which each event in the sequence updates a deductive database of facts (via domain-specific pattern-matching rules); future events are then conditioned on the database contents. They show how to convert such a symbolic system into a neuro-symbolic continuous-time generative model, in which each database fact and the possible event has a time-varying embedding that is derived from its symbolic provenance. In this paper, we modify both models, replacing their recurrent LSTM-based architectures with flatter attention-based architectures (Vaswani et al., 2017), which are simpler and more parallelizable. This does not appear to hurt our accuracy, which is comparable to or better than that of the original models as well as (where applicable) previous attention-based methods (Zuo et al., 2020; Zhang et al., 2020a).
    Ultra-sensitive Flexible Sponge-Sensor Array for Muscle Activities Detection and Human Limb Motion Recognition. (arXiv:2205.03238v1 [eess.SP])
    Human limb motion tracking and recognition plays an important role in medical rehabilitation training, lower limb assistance, prosthetics design for amputees, feedback control for assistive robots, etc. Lightweight wearable sensors, including inertial sensors, surface electromyography sensors, and flexible strain/pressure, are promising to become the next-generation human motion capture devices. Herein, we present a wireless wearable device consisting of a sixteen-channel flexible sponge-based pressure sensor array to recognize various human lower limb motions by detecting contours on the human skin caused by calf gastrocnemius muscle actions. Each sensing element is a round porous structure of thin carbon nanotube/polydimethylsiloxane nanocomposites with a diameter of 4 mm and thickness of about 400 {\mu}m. Three human subjects were recruited to perform ten different lower limb motions while wearing the developed device. The motion classification result with the support vector machine method shows a macro-recall of about 94.48% for all ten motions tested. This work demonstrates a portable wearable muscle activity detection device with a lower limb motion recognition application, which can be potentially used in assistive robot control, healthcare, sports monitoring, etc.
    R-GCN: The R Could Stand for Random. (arXiv:2203.02424v2 [cs.LG] UPDATED)
    The inception of the Relational Graph Convolutional Network (R-GCN) marked a milestone in the Semantic Web domain as a widely cited method that generalises end-to-end hierarchical representation learning to Knowledge Graphs (KGs). R-GCNs generate representations for nodes of interest by repeatedly aggregating parameterised, relation-specific transformations of their neighbours. However, in this paper, we argue that the the R-GCN's main contribution lies in this "message passing" paradigm, rather than the learned weights. To this end, we introduce the "Random Relational Graph Convolutional Network" (RR-GCN), which leaves all parameters untrained and thus constructs node embeddings by aggregating randomly transformed random representations from neighbours, i.e., with no learned parameters. We empirically show that RR-GCNs can compete with fully trained R-GCNs in both node classification and link prediction settings.
    MPAF: Model Poisoning Attacks to Federated Learning based on Fake Clients. (arXiv:2203.08669v2 [cs.CR] UPDATED)
    Existing model poisoning attacks to federated learning assume that an attacker has access to a large fraction of compromised genuine clients. However, such assumption is not realistic in production federated learning systems that involve millions of clients. In this work, we propose the first Model Poisoning Attack based on Fake clients called MPAF. Specifically, we assume the attacker injects fake clients to a federated learning system and sends carefully crafted fake local model updates to the cloud server during training, such that the learnt global model has low accuracy for many indiscriminate test inputs. Towards this goal, our attack drags the global model towards an attacker-chosen base model that has low accuracy. Specifically, in each round of federated learning, the fake clients craft fake local model updates that point to the base model and scale them up to amplify their impact before sending them to the cloud server. Our experiments show that MPAF can significantly decrease the test accuracy of the global model, even if classical defenses and norm clipping are adopted, highlighting the need for more advanced defenses.
    Let's Go to the Alien Zoo: Introducing an Experimental Framework to Study Usability of Counterfactual Explanations for Machine Learning. (arXiv:2205.03398v1 [cs.HC])
    To foster usefulness and accountability of machine learning (ML), it is essential to explain a model's decisions in addition to evaluating its performance. Accordingly, the field of explainable artificial intelligence (XAI) has resurfaced as a topic of active research, offering approaches to address the "how" and "why" of automated decision-making. Within this domain, counterfactual explanations (CFEs) have gained considerable traction as a psychologically grounded approach to generate post-hoc explanations. To do so, CFEs highlight what changes to a model's input would have changed its prediction in a particular way. However, despite the introduction of numerous CFE approaches, their usability has yet to be thoroughly validated at the human level. Thus, to advance the field of XAI, we introduce the Alien Zoo, an engaging, web-based and game-inspired experimental framework. The Alien Zoo provides the means to evaluate usability of CFEs for gaining new knowledge from an automated system, targeting novice users in a domain-general context. As a proof of concept, we demonstrate the practical efficacy and feasibility of this approach in a user study. Our results suggest that users benefit from receiving CFEs compared to no explanation, both in terms of objective performance in the proposed iterative learning task, and subjective usability. With this work, we aim to equip research groups and practitioners with the means to easily run controlled and well-powered user studies to complement their otherwise often more technology-oriented work. Thus, in the interest of reproducible research, we provide the entire code, together with the underlying models and user data.
    Vehicle management in a modular production context using Deep Q-Learning. (arXiv:2205.03294v1 [cs.LG])
    We investigate the feasibility of deploying Deep-Q based deep reinforcement learning agents to job-shop scheduling problems in the context of modular production facilities, using discrete event simulations for the environment. These environments are comprised of a source and sink for the parts to be processed, as well as (several) workstations. The agents are trained to schedule automated guided vehicles to transport the parts back and forth between those stations in an optimal fashion. Starting from a very simplistic setup, we increase the complexity of the environment and compare the agents' performances with well established heuristic approaches, such as first-in-first-out based agents, cost tables and a nearest-neighbor approach. We furthermore seek particular configurations of the environments in which the heuristic approaches struggle, to investigate to what degree the Deep-Q agents are affected by these challenges. We find that Deep-Q based agents show comparable performance as the heuristic baselines. Furthermore, our findings suggest that the DRL agents exhibit an increased robustness to noise, as compared to the conventional approaches. Overall, we find that DRL agents constitute a valuable approach for this type of scheduling problems.
    Fundamental Performance Limits for Sensor-Based Robot Control and Policy Learning. (arXiv:2202.00129v2 [cs.RO] UPDATED)
    Our goal is to develop theory and algorithms for establishing fundamental limits on performance for a given task imposed by a robot's sensors. In order to achieve this, we define a quantity that captures the amount of task-relevant information provided by a sensor. Using a novel version of the generalized Fano inequality from information theory, we demonstrate that this quantity provides an upper bound on the highest achievable expected reward for one-step decision making tasks. We then extend this bound to multi-step problems via a dynamic programming approach. We present algorithms for numerically computing the resulting bounds, and demonstrate our approach on three examples: (i) the lava problem from the literature on partially observable Markov decision processes, (ii) an example with continuous state and observation spaces corresponding to a robot catching a freely-falling object, and (iii) obstacle avoidance using a depth sensor with non-Gaussian noise. We demonstrate the ability of our approach to establish strong limits on achievable performance for these problems by comparing our upper bounds with achievable lower bounds (computed by synthesizing or learning concrete control policies).
    DADApy: Distance-based Analysis of DAta-manifolds in Python. (arXiv:2205.03373v1 [cs.LG])
    DADApy is a python software package for analysing and characterising high-dimensional data manifolds. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics. We review the main functionalities of the package and exemplify its usage in toy cases and in a real-world application. The package is freely available under the open-source Apache 2.0 license and can be downloaded from the Github page https://github.com/sissa-data-science/DADApy.
    Designing Robust Biotechnological Processes Regarding Variabilities using Multi-Objective Optimization Applied to a Biopharmaceutical Seed Train Design. (arXiv:2205.03261v1 [cs.LG])
    Development and optimization of biopharmaceutical production processes with cell cultures is cost- and time-consuming and often performed rather empirically. Efficient optimization of multiple-objectives like process time, viable cell density, number of operating steps & cultivation scales, required medium, amount of product as well as product quality depicts a promising approach. This contribution presents a workflow which couples uncertainty-based upstream simulation and Bayes optimization using Gaussian processes. Its application is demonstrated in a simulation case study for a relevant industrial task in process development, the design of a robust cell culture expansion process (seed train), meaning that despite uncertainties and variabilities concerning cell growth, low variations of viable cell density during the seed train are obtained. Compared to a non-optimized reference seed train, the optimized process showed much lower deviation rates regarding viable cell densities (<~10% instead of 41.7%) using 5 or 4 shake flask scales and seed train duration could be reduced by 56 h from 576 h to 520 h. Overall, it is shown that applying Bayes optimization allows for optimization of a multi-objective optimization function with several optimizable input variables and under a considerable amount of constraints with a low computational effort. This approach provides the potential to be used in form of a decision tool, e.g. for the choice of an optimal and robust seed train design or for further optimization tasks within process development.  ( 2 min )
    PTFlash: A deep learning framework for isothermal two-phase equilibrium calculations. (arXiv:2205.03090v1 [physics.chem-ph])
    Phase equilibrium calculations are an essential part of numerical simulations of multi-component multi-phase flow in porous media, accounting for the largest share of the computational time. In this work, we introduce a GPUenabled, fast, and parallel framework, PTFlash, that vectorizes algorithms required for isothermal two-phase flash calculations using PyTorch, and can facilitate a wide range of downstream applications. In addition, to further accelerate PTFlash, we design two task-specific neural networks, one for predicting the stability of given mixtures and the other for providing estimates of the distribution coefficients, which are trained offline and help shorten computation time by sidestepping stability analysis and reducing the number of iterations to reach convergence. The evaluation of PTFlash was conducted on three case studies involving hydrocarbons, CO$_2$ and N$_2$ , for which the phase equilibrium was tested over a large range of temperature, pressure and composition conditions, using the Soave-Redlich-Kwong (SRK) equation of state. We compare PTFlash with an in-house thermodynamic library, Carnot, written in C++ and performing flash calculations one by one on CPU. Results show speed-ups on large scale calculations up to two order of magnitudes, while maintaining perfect precision with the reference solution provided by Carnot.  ( 2 min )
    On boundary conditions parametrized by analytic functions. (arXiv:2205.03185v1 [cs.LG])
    Computer algebra can answer various questions about partial differential equations using symbolic algorithms. However, the inclusion of data into equations is rare in computer algebra. Therefore, recently, computer algebra models have been combined with Gaussian processes, a regression model in machine learning, to describe the behavior of certain differential equations under data. While it was possible to describe polynomial boundary conditions in this context, we extend these models to analytic boundary conditions. Additionally, we describe the necessary algorithms for Gr\"obner and Janet bases of Weyl algebras with certain analytic coefficients. Using these algorithms, we provide examples of divergence-free flow in domains bounded by analytic functions and adapted to observations.  ( 2 min )
    Federated Learning with Noisy User Feedback. (arXiv:2205.03092v1 [cs.LG])
    Machine Learning (ML) systems are getting increasingly popular, and drive more and more applications and services in our daily life. This has led to growing concerns over user privacy, since human interaction data typically needs to be transmitted to the cloud in order to train and improve such systems. Federated learning (FL) has recently emerged as a method for training ML models on edge devices using sensitive user data and is seen as a way to mitigate concerns over data privacy. However, since ML models are most commonly trained with label supervision, we need a way to extract labels on edge to make FL viable. In this work, we propose a strategy for training FL models using positive and negative user feedback. We also design a novel framework to study different noise patterns in user feedback, and explore how well standard noise-robust objectives can help mitigate this noise when training models in a federated setting. We evaluate our proposed training setup through detailed experiments on two text classification datasets and analyze the effects of varying levels of user reliability and feedback noise on model performance. We show that our method improves substantially over a self-training baseline, achieving performance closer to models trained with full supervision.  ( 2 min )
    Investigating and Explaining the Frequency Bias in Image Classification. (arXiv:2205.03154v1 [cs.CV])
    CNNs exhibit many behaviors different from humans, one of which is the capability of employing high-frequency components. This paper discusses the frequency bias phenomenon in image classification tasks: the high-frequency components are actually much less exploited than the low- and mid-frequency components. We first investigate the frequency bias phenomenon by presenting two observations on feature discrimination and learning priority. Furthermore, we hypothesize that (i) the spectral density, (ii) class consistency directly affect the frequency bias. Specifically, our investigations verify that the spectral density of datasets mainly affects the learning priority, while the class consistency mainly affects the feature discrimination.  ( 2 min )
    TTRS: Tinkoff Transactions Recommender System benchmark. (arXiv:2110.05589v2 [cs.LG] UPDATED)
    Over the past decade, tremendous progress has been made in inventing new RecSys methods. However, one of the fundamental problems of the RecSys research community remains the lack of applied datasets and benchmarks with well-defined evaluation rules and metrics to test these novel approaches. In this article, we present the TTRS - Tinkoff Transactions Recommender System benchmark. This financial transaction benchmark contains over 2 million interactions between almost 10,000 users and more than 1,000 merchant brands over 14 months. To the best of our knowledge, this is the first publicly available financial transactions dataset. To make it more suitable for possible applications, we provide a complete description of the data collection pipeline, its preprocessing, and the resulting dataset statistics. We also present a comprehensive comparison of the current popular RecSys methods on the next-period recommendation task and conduct a detailed analysis of their performance against various metrics and datasets.  ( 2 min )
    Building a 3-Player Mahjong AI using Deep Reinforcement Learning. (arXiv:2202.12847v2 [cs.AI] UPDATED)
    Mahjong is a popular multi-player imperfect-information game developed in China in the late 19th-century, with some very challenging features for AI research. Sanma, being a 3-player variant of the Japanese Riichi Mahjong, possesses unique characteristics including fewer tiles and, consequently, a more aggressive playing style. It is thus challenging and of great research interest in its own right, but has not yet been explored. In this paper, we present Meowjong, an AI for Sanma using deep reinforcement learning. We define an informative and compact 2-dimensional data structure for encoding the observable information in a Sanma game. We pre-train 5 convolutional neural networks (CNNs) for Sanma's 5 actions -- discard, Pon, Kan, Kita and Riichi, and enhance the major action's model, namely the discard model, via self-play reinforcement learning using the Monte Carlo policy gradient method. Meowjong's models achieve test accuracies comparable with AIs for 4-player Mahjong through supervised learning, and gain a significant further enhancement from reinforcement learning. Being the first ever AI in Sanma, we claim that Meowjong stands as a state-of-the-art in this game.  ( 2 min )
    Towards QD-suite: developing a set of benchmarks for Quality-Diversity algorithms. (arXiv:2205.03207v1 [cs.LG])
    While the field of Quality-Diversity (QD) has grown into a distinct branch of stochastic optimization, a few problems, in particular locomotion and navigation tasks, have become de facto standards. Are such benchmarks sufficient? Are they representative of the key challenges faced by QD algorithms? Do they provide the ability to focus on one particular challenge by properly disentangling it from others? Do they have much predictive power in terms of scalability and generalization? Existing benchmarks are not standardized, and there is currently no MNIST equivalent for QD. Inspired by recent works on Reinforcement Learning benchmarks, we argue that the identification of challenges faced by QD methods and the development of targeted, challenging, scalable but affordable benchmarks is an important step. As an initial effort, we identify three problems that are challenging in sparse reward settings, and propose associated benchmarks: (1) Behavior metric bias, which can result from the use of metrics that do not match the structure of the behavior space. (2) Behavioral Plateaus, with varying characteristics, such that escaping them would require adaptive QD algorithms and (3) Evolvability Traps, where small variations in genotype result in large behavioral changes. The environments that we propose satisfy the properties listed above.  ( 2 min )
    Remote Blood Oxygen Estimation From Videos Using Neural Networks. (arXiv:2107.05087v2 [cs.LG] UPDATED)
    Blood oxygen saturation (SpO$_2$) is an essential indicator of respiratory functionality and is receiving increasing attention during the COVID-19 pandemic. Clinical findings show that it is possible for COVID-19 patients to have significantly low SpO$_2$ before any obvious symptoms. The prevalence of cameras has motivated researchers to investigate methods for monitoring SpO$_2$ using videos. Most prior schemes involving smartphones are contact-based: They require a fingertip to cover the phone's camera and the nearby light source to capture re-emitted light from the illuminated tissue. In this paper, we propose the first convolutional neural network based noncontact SpO$_2$ estimation scheme using smartphone cameras. The scheme analyzes the videos of a participant's hand for physiological sensing, which is convenient and comfortable, and can protect their privacy and allow for keeping face masks on. We design our neural network architectures inspired by the optophysiological models for SpO$_2$ measurement and demonstrate the explainability by visualizing the weights for channel combination. Our proposed models outperform the state-of-the-art model that is designed for contact-based SpO$_2$ measurement, showing the potential of our proposed method to contribute to public health. We also analyze the impact of skin type and the side of a hand on SpO$_2$ estimation performance.  ( 2 min )
    TAGLETS: A System for Automatic Semi-Supervised Learning with Auxiliary Data. (arXiv:2111.04798v3 [cs.LG] UPDATED)
    Machine learning practitioners often have access to a spectrum of data: labeled data for the target task (which is often limited), unlabeled data, and auxiliary data, the many available labeled datasets for other tasks. We describe TAGLETS, a system built to study techniques for automatically exploiting all three types of data and creating high-quality, servable classifiers. The key components of TAGLETS are: (1) auxiliary data organized according to a knowledge graph, (2) modules encapsulating different methods for exploiting auxiliary and unlabeled data, and (3) a distillation stage in which the ensembled modules are combined into a servable model. We compare TAGLETS with state-of-the-art transfer learning and semi-supervised learning methods on four image classification tasks. Our study covers a range of settings, varying the amount of labeled data and the semantic relatedness of the auxiliary data to the target task. We find that the intelligent incorporation of auxiliary and unlabeled data into multiple learning techniques enables TAGLETS to match-and most often significantly surpass-these alternatives. TAGLETS is available as an open-source system at github.com/BatsResearch/taglets.  ( 2 min )
    Optimal quantum dataset for learning a unitary transformation. (arXiv:2203.00546v2 [quant-ph] UPDATED)
    Unitary transformations formulate the time evolution of quantum states. How to learn a unitary transformation efficiently is a fundamental problem in quantum machine learning. The most natural and leading strategy is to train a quantum machine learning model based on a quantum dataset. Although presence of more training data results in better models, using too much data reduces the efficiency of training. In this work, we solve the problem on the minimum size of sufficient quantum datasets for learning a unitary transformation exactly, which reveals the power and limitation of quantum data. First, we prove that the minimum size of dataset with pure states is $2^n$ for learning an $n$-qubit unitary transformation. To fully explore the capability of quantum data, we introduce a practical quantum dataset consisting of $n+1$ elementary tensor product states that are sufficient for exact training. The main idea is to simplify the structure utilizing decoupling, which leads to an exponential improvement on the size over the datasets with pure states. Furthermore, we show that the size of quantum dataset with mixed states can be reduced to a constant, which yields an optimal quantum dataset for learning a unitary. We showcase the applications of our results in oracle compiling and Hamiltonian simulation. Notably, to accurately simulate a 3-qubit one-dimensional nearest-neighbor Heisenberg model, our circuit only uses $48$ elementary quantum gates, which is significantly less than $4320$ gates in the circuit constructed by the Trotter-Suzuki product formula.  ( 2 min )
    LPGNet: Link Private Graph Networks for Node Classification. (arXiv:2205.03105v1 [cs.LG])
    Classification tasks on labeled graph-structured data have many important applications ranging from social recommendation to financial modeling. Deep neural networks are increasingly being used for node classification on graphs, wherein nodes with similar features have to be given the same label. Graph convolutional networks (GCNs) are one such widely studied neural network architecture that perform well on this task. However, powerful link-stealing attacks on GCNs have recently shown that even with black-box access to the trained model, inferring which links (or edges) are present in the training graph is practical. In this paper, we present a new neural network architecture called LPGNet for training on graphs with privacy-sensitive edges. LPGNet provides differential privacy (DP) guarantees for edges using a novel design for how graph edge structure is used during training. We empirically show that LPGNet models often lie in the sweet spot between providing privacy and utility: They can offer better utility than "trivially" private architectures which use no edge information (e.g., vanilla MLPs) and better resilience against existing link-stealing attacks than vanilla GCNs which use the full edge structure. LPGNet also offers consistently better privacy-utility tradeoffs than DPGCN, which is the state-of-the-art mechanism for retrofitting differential privacy into conventional GCNs, in most of our evaluated datasets.  ( 2 min )
    Controlled Dropout for Uncertainty Estimation. (arXiv:2205.03109v1 [cs.LG])
    Uncertainty quantification in a neural network is one of the most discussed topics for safety-critical applications. Though Neural Networks (NNs) have achieved state-of-the-art performance for many applications, they still provide unreliable point predictions, which lack information about uncertainty estimates. Among various methods to enable neural networks to estimate uncertainty, Monte Carlo (MC) dropout has gained much popularity in a short period due to its simplicity. In this study, we present a new version of the traditional dropout layer where we are able to fix the number of dropout configurations. As such, each layer can take and apply the new dropout layer in the MC method to quantify the uncertainty associated with NN predictions. We conduct experiments on both toy and realistic datasets and compare the results with the MC method using the traditional dropout layer. Performance analysis utilizing uncertainty evaluation metrics corroborates that our dropout layer offers better performance in most cases.  ( 2 min )
    Newton-MR: Inexact Newton Method With Minimum Residual Sub-problem Solver. (arXiv:1810.00303v4 [math.OC] UPDATED)
    We consider a variant of inexact Newton Method, called Newton-MR, in which the least-squares sub-problems are solved approximately using Minimum Residual method. By construction, Newton-MR can be readily applied for unconstrained optimization of a class of non-convex problems known as invex, which subsumes convexity as a sub-class. For invex optimization, instead of the classical Lipschitz continuity assumptions on gradient and Hessian, Newton-MR's global convergence can be guaranteed under a weaker notion of joint regularity of Hessian and gradient. We also obtain Newton-MR's problem-independent local convergence to the set of minima. We show that fast local/global convergence can be guaranteed under a novel inexactness condition, which, to our knowledge, is much weaker than the prior related works. Numerical results demonstrate the performance of Newton-MR as compared with several other Newton-type alternatives on a few machine learning problems.  ( 2 min )
    How to Spend Your Robot Time: Bridging Kickstarting and Offline Reinforcement Learning for Vision-based Robotic Manipulation. (arXiv:2205.03353v1 [cs.RO])
    Reinforcement learning (RL) has been shown to be effective at learning control from experience. However, RL typically requires a large amount of online interaction with the environment. This limits its applicability to real-world settings, such as in robotics, where such interaction is expensive. In this work we investigate ways to minimize online interactions in a target task, by reusing a suboptimal policy we might have access to, for example from training on related prior tasks, or in simulation. To this end, we develop two RL algorithms that can speed up training by using not only the action distributions of teacher policies, but also data collected by such policies on the task at hand. We conduct a thorough experimental study of how to use suboptimal teachers on a challenging robotic manipulation benchmark on vision-based stacking with diverse objects. We compare our methods to offline, online, offline-to-online, and kickstarting RL algorithms. By doing so, we find that training on data from both the teacher and student, enables the best performance for limited data budgets. We examine how to best allocate a limited data budget -- on the target task -- between the teacher and the student policy, and report experiments using varying budgets, two teachers with different degrees of suboptimality, and five stacking tasks that require a diverse set of behaviors. Our analysis, both in simulation and in the real world, shows that our approach is the best across data budgets, while standard offline RL from teacher rollouts is surprisingly effective when enough data is given.  ( 2 min )
    HumanAL: Calibrating Human Matching Beyond a Single Task. (arXiv:2205.03209v1 [cs.DB])
    This work offers a novel view on the use of human input as labels, acknowledging that humans may err. We build a behavioral profile for human annotators which is used as a feature representation of the provided input. We show that by utilizing black-box machine learning, we can take into account human behavior and calibrate their input to improve the labeling quality. To support our claims and provide a proof-of-concept, we experiment with three different matching tasks, namely, schema matching, entity matching and text matching. Our empirical evaluation suggests that the method can improve the quality of gathered labels in multiple settings including cross-domain (across different matching tasks).  ( 2 min )
    Beyond backpropagation: implicit gradients for bilevel optimization. (arXiv:2205.03076v1 [cs.LG])
    This paper reviews gradient-based techniques to solve bilevel optimization problems. Bilevel optimization is a general way to frame the learning of systems that are implicitly defined through a quantity that they minimize. This characterization can be applied to neural networks, optimizers, algorithmic solvers and even physical systems, and allows for greater modeling flexibility compared to an explicit definition of such systems. Here we focus on gradient-based approaches that solve such problems. We distinguish them in two categories: those rooted in implicit differentiation, and those that leverage the equilibrium propagation theorem. We present the mathematical foundations that are behind such methods, introduce the gradient-estimation algorithms in detail and compare the competitive advantages of the different approaches.  ( 2 min )
    Fast Rate Generalization Error Bounds: Variations on a Theme. (arXiv:2205.03131v1 [cs.IT])
    A recent line of works, initiated by \cite{russo2016controlling} and \cite{xu2017information}, has shown that the generalization error of a learning algorithm can be upper bounded by information measures. In most of the relevant works, the convergence rate of the expected generalization error is in the form of O(\sqrt{\lambda/{n}}) where \lambda is some information-theoretic quantities such as the mutual information between the data sample and the learned hypothesis. However, such a learning rate is typically considered to be "slow", compared to a "fast rate" of O(1/n) in many learning scenarios. In this work, we first show that the square root does not necessarily imply a slow rate, and a fast rate (O(1/n)) result can still be obtained using this bound under appropriate assumptions. Furthermore, we identify the key conditions needed for the fast rate generalization error, which we call the (\eta,c)-central condition. Under this condition, we give information-theoretic bounds on the generalization error and excess risk, with a convergence rate of O(\lambda/{n}) for specific learning algorithms such as empirical risk minimization. Finally, analytical examples are given to show the effectiveness of the bounds.  ( 2 min )
    Characterizing TMS-EEG perturbation indexes using signal energy: initial study on Alzheimer's Disease classification. (arXiv:2205.03241v1 [eess.SP])
    Transcranial Magnetic Stimulation (TMS) combined with EEG recordings (TMS-EEG) has shown great potential in the study of the brain and in particular of Alzheimer's Disease (AD). In this study, we propose an automatic method of determining the duration of TMS induced perturbation of the EEG signal as a potential metric reflecting the brain's functional alterations. A preliminary study is conducted in patients with Alzheimer's disease (AD). Three metrics for characterizing the strength and duration of TMS evoked EEG (TEP) activity are proposed and their potential in identifying AD patients from healthy controls was investigated. A dataset of TMS-EEG recordings from 17 AD and 17 healthy controls (HC) was used in our analysis. A Random Forest classification algorithm was trained on the extracted TEP metrics and its performance is evaluated in a leave-one-subject-out cross-validation. The created model showed promising results in identifying AD patients from HC with an accuracy, sensitivity and specificity of 69.32%, 72.23% and 66.41%, respectively.  ( 2 min )
    Detection of Propaganda Techniques in Visuo-Lingual Metaphor in Memes. (arXiv:2205.02937v1 [cs.CV])
    The exponential rise of social media networks has allowed the production, distribution, and consumption of data at a phenomenal rate. Moreover, the social media revolution has brought a unique phenomenon to social media platforms called Internet memes. Internet memes are one of the most popular contents used on social media, and they can be in the form of images with a witty, catchy, or satirical text description. In this paper, we are dealing with propaganda that is often seen in Internet memes in recent times. Propaganda is communication, which frequently includes psychological and rhetorical techniques to manipulate or influence an audience to act or respond as the propagandist wants. To detect propaganda in Internet memes, we propose a multimodal deep learning fusion system that fuses the text and image feature representations and outperforms individual models based solely on either text or image modalities.  ( 2 min )
    Emp-RFT: Empathetic Response Generation via Recognizing Feature Transitions between Utterances. (arXiv:2205.03112v1 [cs.CL])
    Each utterance in multi-turn empathetic dialogues has features such as emotion, keywords, and utterance-level meaning. Feature transitions between utterances occur naturally. However, existing approaches fail to perceive the transitions because they extract features for the context at the coarse-grained level. To solve the above issue, we propose a novel approach of recognizing feature transitions between utterances, which helps understand the dialogue flow and better grasp the features of utterance that needs attention. Also, we introduce a response generation strategy to help focus on emotion and keywords related to appropriate features when generating responses. Experimental results show that our approach outperforms baselines and especially, achieves significant improvements on multi-turn dialogues.  ( 2 min )
    Scalable computation of prediction intervals for neural networks via matrix sketching. (arXiv:2205.03194v1 [stat.ML])
    Accounting for the uncertainty in the predictions of modern neural networks is a challenging and important task in many domains. Existing algorithms for uncertainty estimation require modifying the model architecture and training procedure (e.g., Bayesian neural networks) or dramatically increase the computational cost of predictions such as approaches based on ensembling. This work proposes a new algorithm that can be applied to a given trained neural network and produces approximate prediction intervals. The method is based on the classical delta method in statistics but achieves computational efficiency by using matrix sketching to approximate the Jacobian matrix. The resulting algorithm is competitive with state-of-the-art approaches for constructing predictive intervals on various regression datasets from the UCI repository.  ( 2 min )
    Belief propagation for permutations, rankings, and partial orders. (arXiv:2110.00513v2 [cs.AI] UPDATED)
    Many datasets give partial information about an ordering or ranking by indicating which team won a game, which item a user prefers, or who infected whom. We define a continuous spin system whose Gibbs distribution is the posterior distribution on permutations, given a probabilistic model of these interactions. Using the cavity method we derive a belief propagation algorithm that computes the marginal distribution of each node's position. In addition, the Bethe free energy lets us approximate the number of linear extensions of a partial order and perform model selection between competing probabilistic models, such as the Bradley-Terry-Luce model of noisy comparisons and its cousins.  ( 2 min )
    Learning Optimal Propagation for Graph Neural Networks. (arXiv:2205.02998v1 [cs.LG])
    Graph Neural Networks (GNNs) have achieved tremendous success in a variety of real-world applications by relying on the fixed graph data as input. However, the initial input graph might not be optimal in terms of specific downstream tasks, because of information scarcity, noise, adversarial attacks, or discrepancies between the distribution in graph topology, features, and groundtruth labels. In this paper, we propose a bi-level optimization-based approach for learning the optimal graph structure via directly learning the Personalized PageRank propagation matrix as well as the downstream semi-supervised node classification simultaneously. We also explore a low-rank approximation model for further reducing the time complexity. Empirical evaluations show the superior efficacy and robustness of the proposed model over all baseline methods.  ( 2 min )
    Explaining the Effectiveness of Multi-Task Learning for Efficient Knowledge Extraction from Spine MRI Reports. (arXiv:2205.02979v1 [cs.LG])
    Pretrained Transformer based models finetuned on domain specific corpora have changed the landscape of NLP. However, training or fine-tuning these models for individual tasks can be time consuming and resource intensive. Thus, a lot of current research is focused on using transformers for multi-task learning (Raffel et al.,2020) and how to group the tasks to help a multi-task model to learn effective representations that can be shared across tasks (Standley et al., 2020; Fifty et al., 2021). In this work, we show that a single multi-tasking model can match the performance of task specific models when the task specific models show similar representations across all of their hidden layers and their gradients are aligned, i.e. their gradients follow the same direction. We hypothesize that the above observations explain the effectiveness of multi-task learning. We validate our observations on our internal radiologist-annotated datasets on the cervical and lumbar spine. Our method is simple and intuitive, and can be used in a wide range of NLP problems.  ( 2 min )
    Design Target Achievement Index: A Differentiable Metric to Enhance Deep Generative Models in Multi-Objective Inverse Design. (arXiv:2205.03005v1 [cs.LG])
    Deep Generative Machine Learning Models have been growing in popularity across the design community thanks to their ability to learn and mimic complex data distributions. While early works are promising, further advancement will depend on addressing several critical considerations such as design quality, feasibility, novelty, and targeted inverse design. We propose the Design Target Achievement Index (DTAI), a differentiable, tunable metric that scores a design's ability to achieve designer-specified minimum performance targets. We demonstrate that DTAI can drastically improve the performance of generated designs when directly used as a training loss in Deep Generative Models. We apply the DTAI loss to a Performance-Augmented Diverse GAN (PaDGAN) and demonstrate superior generative performance compared to a set of baseline Deep Generative Models including a Multi-Objective PaDGAN and specialized tabular generation algorithms like the Conditional Tabular GAN (CTGAN). We further enhance PaDGAN with an auxiliary feasibility classifier to encourage feasible designs. To evaluate methods, we propose a comprehensive set of evaluation metrics for generative methods that focus on feasibility, diversity, and satisfaction of design performance targets. Methods are tested on a challenging benchmarking problem: the FRAMED bicycle frame design dataset featuring mixed-datatype parametric data, heavily skewed and multimodal distributions, and ten competing performance objectives.  ( 2 min )
    A CNN Approach for 5G mmWave Positioning Using Beamformed CSI Measurements. (arXiv:2205.03236v1 [eess.SP])
    The advent of Artificial Intelligence (AI) has impacted all aspects of human life. One of the concrete examples of AI impact is visible in radio positioning. In this article, for the first time we utilize the power of AI by training a Convolutional Neural Network (CNN) using 5G New Radio (NR) fingerprints consisting of beamformed Channel State Information (CSI). By observing CSI, it is possible to characterize the multipath channel between the transmitter and the receiver, and thus provide a good source of spatiotemporal data to find the position of a User Equipment (UE). We collect ray-tracing-based 5G NR CSI from an urban area. The CSI data of the signals from one Base Station (BS) is collected at the reference points with known positions to train a CNN. We evaluate our work by testing: a) the robustness of the trained network for estimating the positions for the new measurements on the same reference points and b) the accuracy of the CNN-based position estimation while the UE is on points other than the reference points. The results prove that our trained network for a specific urban environment can estimate the UE position with a minimum mean error of 0.98 m.  ( 2 min )
    Dynamic Sparse Training for Deep Reinforcement Learning. (arXiv:2106.04217v3 [cs.LG] UPDATED)
    Deep reinforcement learning (DRL) agents are trained through trial-and-error interactions with the environment. This leads to a long training time for dense neural networks to achieve good performance. Hence, prohibitive computation and memory resources are consumed. Recently, learning efficient DRL agents has received increasing attention. Yet, current methods focus on accelerating inference time. In this paper, we introduce for the first time a dynamic sparse training approach for deep reinforcement learning to accelerate the training process. The proposed approach trains a sparse neural network from scratch and dynamically adapts its topology to the changing data distribution during training. Experiments on continuous control tasks show that our dynamic sparse agents achieve higher performance than the equivalent dense methods, reduce the parameter count and floating-point operations (FLOPs) by 50%, and have a faster learning speed that enables reaching the performance of dense agents with 40-50% reduction in the training steps.  ( 2 min )
    Green Accelerated Hoeffding Tree. (arXiv:2205.03184v1 [cs.LG])
    State-of-the-art machine learning solutions mainly focus on creating highly accurate models without constraints on hardware resources. Stream mining algorithms are designed to run on resource-constrained devices, thus a focus on low power and energy and memory-efficient is essential. The Hoeffding tree algorithm is able to create energy-efficient models, but at the cost of less accurate trees in comparison to their ensembles counterpart. Ensembles of Hoeffding trees, on the other hand, create a highly accurate forest of trees but consume five times more energy on average. An extension that tried to obtain similar results to ensembles of Hoeffding trees was the Extremely Fast Decision Tree (EFDT). This paper presents the Green Accelerated Hoeffding Tree (GAHT) algorithm, an extension of the EFDT algorithm with a lower energy and memory footprint and the same (or higher for some datasets) accuracy levels. GAHT grows the tree setting individual splitting criteria for each node, based on the distribution of the number of instances over each particular leaf. The results show that GAHT is able to achieve the same competitive accuracy results compared to EFDT and ensembles of Hoeffding trees while reducing the energy consumption up to 70%.  ( 2 min )
    A Deep Bayesian Bandits Approach for Anticancer Therapy: Exploration via Functional Prior. (arXiv:2205.02944v1 [cs.LG])
    Learning personalized cancer treatment with machine learning holds great promise to improve cancer patients' chance of survival. Despite recent advances in machine learning and precision oncology, this approach remains challenging as collecting data in preclinical/clinical studies for modeling multiple treatment efficacies is often an expensive, time-consuming process. Moreover, the randomization in treatment allocation proves to be suboptimal since some participants/samples are not receiving the most appropriate treatments during the trial. To address this challenge, we formulate drug screening study as a "contextual bandit" problem, in which an algorithm selects anticancer therapeutics based on contextual information about cancer cell lines while adapting its treatment strategy to maximize treatment response in an "online" fashion. We propose using a novel deep Bayesian bandits framework that uses functional prior to approximate posterior for drug response prediction based on multi-modal information consisting of genomic features and drug structure. We empirically evaluate our method on three large-scale in vitro pharmacogenomic datasets and show that our approach outperforms several benchmarks in identifying optimal treatment for a given cell line.  ( 2 min )
    FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks. (arXiv:2104.08815v3 [cs.CL] UPDATED)
    Increasing concerns and regulations about data privacy and sparsity necessitate the study of privacy-preserving, decentralized learning methods for natural language processing (NLP) tasks. Federated learning (FL) provides promising approaches for a large number of clients (e.g., personal devices or organizations) to collaboratively learn a shared global model to benefit all clients while allowing users to keep their data locally. Despite interest in studying FL methods for NLP tasks, a systematic comparison and analysis is lacking in the literature. Herein, we present the FedNLP, a benchmarking framework for evaluating federated learning methods on four different task formulations: text classification, sequence tagging, question answering, and seq2seq. We propose a universal interface between Transformer-based language models (e.g., BERT, BART) and FL methods (e.g., FedAvg, FedOPT, etc.) under various non-IID partitioning strategies. Our extensive experiments with FedNLP provide empirical comparisons between FL methods and helps us better understand the inherent challenges of this direction. The comprehensive analysis points to intriguing and exciting future research aimed at developing FL methods for NLP tasks.  ( 2 min )
    Probabilistic learning constrained by realizations using a weak formulation of Fourier transform of probability measures. (arXiv:2205.03078v1 [stat.ML])
    This paper deals with the taking into account a given set of realizations as constraints in the Kullback-Leibler minimum principle, which is used as a probabilistic learning algorithm. This permits the effective integration of data into predictive models. We consider the probabilistic learning of a random vector that is made up of either a quantity of interest (unsupervised case) or the couple of the quantity of interest and a control parameter (supervised case). A training set of independent realizations of this random vector is assumed to be given and to be generated with a prior probability measure that is unknown. A target set of realizations of the QoI is available for the two considered cases. The framework is the one of non-Gaussian problems in high dimension. A functional approach is developed on the basis of a weak formulation of the Fourier transform of probability measures (characteristic functions). The construction makes it possible to take into account the target set of realizations of the QoI in the Kullback-Leibler minimum principle. The proposed approach allows for estimating the posterior probability measure of the QoI (unsupervised case) or of the posterior joint probability measure of the QoI with the control parameter (supervised case). The existence and the uniqueness of the posterior probability measure is analyzed for the two cases. The numerical aspects are detailed in order to facilitate the implementation of the proposed method. The presented application in high dimension demonstrates the efficiency and the robustness of the proposed algorithm.  ( 2 min )
    Imperceptible Backdoor Attack: From Input Space to Feature Representation. (arXiv:2205.03190v1 [cs.CR])
    Backdoor attacks are rapidly emerging threats to deep neural networks (DNNs). In the backdoor attack scenario, attackers usually implant the backdoor into the target model by manipulating the training dataset or training process. Then, the compromised model behaves normally for benign input yet makes mistakes when the pre-defined trigger appears. In this paper, we analyze the drawbacks of existing attack approaches and propose a novel imperceptible backdoor attack. We treat the trigger pattern as a special kind of noise following a multinomial distribution. A U-net-based network is employed to generate concrete parameters of multinomial distribution for each benign input. This elaborated trigger ensures that our approach is invisible to both humans and statistical detection. Besides the design of the trigger, we also consider the robustness of our approach against model diagnose-based defences. We force the feature representation of malicious input stamped with the trigger to be entangled with the benign one. We demonstrate the effectiveness and robustness against multiple state-of-the-art defences through extensive datasets and networks. Our trigger only modifies less than 1\% pixels of a benign image while the modification magnitude is 1. Our source code is available at https://github.com/Ekko-zn/IJCAI2022-Backdoor.  ( 2 min )
    The NT-Xent loss upper bound. (arXiv:2205.03169v1 [cs.LG])
    Self-supervised learning is a growing paradigm in deep representation learning, showing great generalization capabilities and competitive performance in low-labeled data regimes. The SimCLR framework proposes the NT-Xent loss for contrastive representation learning. The objective of the loss function is to maximize agreement, similarity, between sampled positive pairs. This short paper derives and proposes an upper bound for the loss and average similarity. An analysis of the implications is however not provided, but we strongly encourage anyone in the field to conduct this.  ( 2 min )
    UAV-aided RF Mapping for Sensing and Connectivity in Wireless Networks. (arXiv:2205.03335v1 [cs.IT])
    The use of unmanned aerial vehicles (UAV) as flying radio access network (RAN) nodes offers a promising complement to traditional fixed terrestrial deployments. More recently yet still in the context of wireless networks, drones have also been envisioned for use as radio frequency (RF) sensing and localization devices. In both cases, the advantage of using UAVs lies in their ability to navigate themselves freely in 3D and in a timely manner to locations of space where the obtained network throughput or sensing performance is optimal. In practice, the selection of a proper location or trajectory for the UAV very much depends on local terrain features, including the position of surrounding radio obstacles. Hence, the robot must be able to map the features of its radio environment as it performs its data communication or sensing services. The challenges related to this task, referred here as radio mapping, are discussed in this paper. Its promises related to efficient trajectory design for autonomous radio-aware UAVs are highlighted, along with algorithm solutions. The advantages induced by radio-mapping in terms of connectivity, sensing, and localization performance are illustrated.  ( 2 min )
    Immiscible Color Flows in Optimal Transport Networks for Image Classification. (arXiv:2205.02938v1 [cs.CV])
    In classification tasks, it is crucial to meaningfully exploit information contained in data. Here, we propose a physics-inspired dynamical system that adapts Optimal Transport principles to effectively leverage color distributions of images. Our dynamics regulates immiscible fluxes of colors traveling on a network built from images. Instead of aggregating colors together, it treats them as different commodities that interact with a shared capacity on edges. Our method outperforms competitor algorithms on image classification tasks in datasets where color information matters.  ( 2 min )
    Crop Type Identification for Smallholding Farms: Analyzing Spatial, Temporal and Spectral Resolutions in Satellite Imagery. (arXiv:2205.03104v1 [cs.CV])
    The integration of the modern Machine Learning (ML) models into remote sensing and agriculture has expanded the scope of the application of satellite images in the agriculture domain. In this paper, we present how the accuracy of crop type identification improves as we move from medium-spatiotemporal-resolution (MSTR) to high-spatiotemporal-resolution (HSTR) satellite images. We further demonstrate that high spectral resolution in satellite imagery can improve prediction performance for low spatial and temporal resolutions (LSTR) images. The F1-score is increased by 7% when using multispectral data of MSTR images as compared to the best results obtained from HSTR images. Similarly, when crop season based time series of multispectral data is used we observe an increase of 1.2% in the F1-score. The outcome motivates further advancements in the field of synthetic band generation.  ( 2 min )
    Low-rank Tensor Learning with Nonconvex Overlapped Nuclear Norm Regularization. (arXiv:2205.03059v1 [cs.LG])
    Nonconvex regularization has been popularly used in low-rank matrix learning. However, extending it for low-rank tensor learning is still computationally expensive. To address this problem, we develop an efficient solver for use with a nonconvex extension of the overlapped nuclear norm regularizer. Based on the proximal average algorithm, the proposed algorithm can avoid expensive tensor folding/unfolding operations. A special "sparse plus low-rank" structure is maintained throughout the iterations, and allows fast computation of the individual proximal steps. Empirical convergence is further improved with the use of adaptive momentum. We provide convergence guarantees to critical points on smooth losses and also on objectives satisfying the Kurdyka-{\L}ojasiewicz condition. While the optimization problem is nonconvex and nonsmooth, we show that its critical points still have good statistical performance on the tensor completion problem. Experiments on various synthetic and real-world data sets show that the proposed algorithm is efficient in both time and space and more accurate than the existing state-of-the-art.  ( 2 min )
    GANs as Gradient Flows that Converge. (arXiv:2205.02910v1 [cs.LG])
    This paper approaches the unsupervised learning problem by gradient descent in the space of probability density functions. Our main result shows that along the gradient flow induced by a distribution-dependent ordinary differential equation (ODE), the unknown data distribution emerges as the long-time limit of this flow of densities. That is, one can uncover the data distribution by simulating the distribution-dependent ODE. Intriguingly, we find that the simulation of the ODE is equivalent to the training of generative adversarial networks (GANs). The GAN framework, by definition a non-cooperative game between a generator and a discriminator, can therefore be viewed alternatively as a cooperative game between a navigator and a calibrator (in collaboration to simulate the ODE). At the theoretic level, this new perspective simplifies the analysis of GANs and gives new insight into their performance. To construct a solution to the distribution-dependent ODE, we first show that the associated nonlinear Fokker-Planck equation has a unique weak solution, using the Crandall-Liggett theorem for differential equations in Banach spaces. From this solution to the Fokker-Planck equation, we construct a unique solution to the ODE, relying on Trevisan's superposition principle. The convergence of the induced gradient flow to the data distribution is obtained by analyzing the Fokker-Planck equation.  ( 2 min )
    Geodesics, Non-linearities and the Archive of Novelty Search. (arXiv:2205.03162v1 [cs.LG])
    The Novelty Search (NS) algorithm was proposed more than a decade ago. However, the mechanisms behind its empirical success are still not well formalized/understood. This short note focuses on the effects of the archive on exploration. Experimental evidence from a few application domains suggests that archive-based NS performs in general better than when Novelty is solely computed with respect to the population. An argument that is often encountered in the literature is that the archive prevents exploration from backtracking or cycling, i.e. from revisiting previously encountered areas in the behavior space. We argue that this is not a complete or accurate explanation as backtracking - beside often being desirable - can actually be enabled by the archive. Through low-dimensional/analytical examples, we show that a key effect of the archive is that it counterbalances the exploration biases that result, among other factors, from the use of inadequate behavior metrics and the non-linearities of the behavior mapping. Our observations seem to hint that attributing a more active role to the archive in sampling can be beneficial.  ( 2 min )
    Real Time On Sensor Gait Phase Detection with 0.5KB Deep Learning Model. (arXiv:2205.03234v1 [eess.SP])
    Gait phase detection with convolution neural network provides accurate classification but demands high computational cost, which inhibits real time low power on-sensor processing. This paper presents a segmentation based gait phase detection with a width and depth downscaled U-Net like model that only needs 0.5KB model size and 67K operations per second with 95.9% accuracy to be easily fitted into resource limited on sensor microcontroller.  ( 2 min )
    Sound2Synth: Interpreting Sound via FM Synthesizer Parameters Estimation. (arXiv:2205.03043v1 [cs.SD])
    Synthesizer is a type of electronic musical instrument that is now widely used in modern music production and sound design. Each parameters configuration of a synthesizer produces a unique timbre and can be viewed as a unique instrument. The problem of estimating a set of parameters configuration that best restore a sound timbre is an important yet complicated problem, i.e.: the synthesizer parameters estimation problem. We proposed a multi-modal deep-learning-based pipeline Sound2Synth, together with a network structure Prime-Dilated Convolution (PDC) specially designed to solve this problem. Our method achieved not only SOTA but also the first real-world applicable results on Dexed synthesizer, a popular FM synthesizer.  ( 2 min )
    New-Onset Diabetes Assessment Using Artificial Intelligence-Enhanced Electrocardiography. (arXiv:2205.02900v1 [cs.LG])
    Undiagnosed diabetes is present in 21.4% of adults with diabetes. Diabetes can remain asymptomatic and undetected due to limitations in screening rates. To address this issue, questionnaires, such as the American Diabetes Association (ADA) Risk test, have been recommended for use by physicians and the public. Based on evidence that blood glucose concentration can affect cardiac electrophysiology, we hypothesized that an artificial intelligence (AI)-enhanced electrocardiogram (ECG) could identify adults with new-onset diabetes. We trained a neural network to estimate HbA1c using a 12-lead ECG and readily available demographics. We retrospectively assembled a dataset comprised of patients with paired ECG and HbA1c data. The population of patients who receive both an ECG and HbA1c may a biased sample of the complete outpatient population, so we adjusted the importance placed on each patient to generate a more representative pseudo-population. We found ECG-based assessment outperforms the ADA Risk test, achieving a higher area under the curve (0.80 vs. 0.68) and positive predictive value (14% vs. 9%) -- 2.6 times the prevalence of diabetes in the cohort. The AI-enhanced ECG significantly outperforms electrophysiologist interpretation of the ECG, suggesting that the task is beyond current clinical capabilities. Given the prevalence of ECGs in clinics and via wearable devices, such a tool would make precise, automated diabetes assessment widely accessible.  ( 2 min )
    The Road to Explainability is Paved with Bias: Measuring the Fairness of Explanations. (arXiv:2205.03295v1 [cs.LG])
    Machine learning models in safety-critical settings like healthcare are often blackboxes: they contain a large number of parameters which are not transparent to users. Post-hoc explainability methods where a simple, human-interpretable model imitates the behavior of these blackbox models are often proposed to help users trust model predictions. In this work, we audit the quality of such explanations for different protected subgroups using real data from four settings in finance, healthcare, college admissions, and the US justice system. Across two different blackbox model architectures and four popular explainability methods, we find that the approximation quality of explanation models, also known as the fidelity, differs significantly between subgroups. We also demonstrate that pairing explainability methods with recent advances in robust machine learning can improve explanation fairness in some settings. However, we highlight the importance of communicating details of non-zero fidelity gaps to users, since a single solution might not exist across all settings. Finally, we discuss the implications of unfair explanation models as a challenging and understudied problem facing the machine learning community.  ( 2 min )
    Semi-Supervised Imitation Learning of Team Policies from Suboptimal Demonstrations. (arXiv:2205.02959v1 [cs.AI])
    We present Bayesian Team Imitation Learner (BTIL), an imitation learning algorithm to model behavior of teams performing sequential tasks in Markovian domains. In contrast to existing multi-agent imitation learning techniques, BTIL explicitly models and infers the time-varying mental states of team members, thereby enabling learning of decentralized team policies from demonstrations of suboptimal teamwork. Further, to allow for sample- and label-efficient policy learning from small datasets, BTIL employs a Bayesian perspective and is capable of learning from semi-supervised demonstrations. We demonstrate and benchmark the performance of BTIL on synthetic multi-agent tasks as well as a novel dataset of human-agent teamwork. Our experiments show that BTIL can successfully learn team policies from demonstrations despite the influence of team members' (time-varying and potentially misaligned) mental states on their behavior.  ( 2 min )
    Explainable multi-class anomaly detection on functional data. (arXiv:2205.02935v1 [stat.ML])
    In this paper we describe an approach for anomaly detection and its explainability in multivariate functional data. The anomaly detection procedure consists of transforming the series into a vector of features and using an Isolation forest algorithm. The explainable procedure is based on the computation of the SHAP coefficients and on the use of a supervised decision tree. We apply it on simulated data to measure the performance of our method and on real data coming from industry.  ( 2 min )
    Understanding Urban Water Consumption using Remotely Sensed Data. (arXiv:2205.02932v1 [cs.CV])
    Urban metabolism is an active field of research that deals with the estimation of emissions and resource consumption from urban regions. The analysis could be carried out through a manual surveyor by the implementation of elegant machine learning algorithms. In this exploratory work, we estimate the water consumption by the buildings in the region captured by satellite imagery. To this end, we break our analysis into three parts: i) Identification of building pixels, given a satellite image, followed by ii) identification of the building type (residential/non-residential) from the building pixels, and finally iii) using the building pixels along with their type to estimate the water consumption using the average per unit area consumption for different building types as obtained from municipal surveys.  ( 2 min )
    Multi-confound regression adversarial network for deep learning-based diagnosis on highly heterogenous clinical data. (arXiv:2205.02885v1 [cs.LG])
    Automated disease detection in medical images using deep learning holds promise to improve the diagnostic ability of radiologists, but routinely collected clinical data frequently contains technical and demographic confounding factors that differ between hospitals, negatively affecting the robustness of diagnostic deep learning models. Thus, there is a critical need for deep learning models that can train on imbalanced datasets without overfitting to site-specific confounding factors. In this work, we developed a novel deep learning architecture, MUCRAN (Multi-Confound Regression Adversarial Network), to train a deep learning model on highly heterogeneous clinical data while regressing demographic and technical confounding factors. We trained MUCRAN using 16,821 clinical T1 Axial brain MRIs collected from Massachusetts General Hospital before 2019 and tested it using post-2019 data to distinguish Alzheimer's disease (AD) patients, identified using both prescriptions of AD drugs and ICD codes, from a non-medicated control group. In external validation tests using MRI data from other hospitals, the model showed a robust performance of over 90% accuracy on newly collected data. This work shows the feasibility of deep learning-based diagnosis in real-world clinical data.  ( 2 min )
    Over-The-Air Federated Learning under Byzantine Attacks. (arXiv:2205.02949v1 [cs.LG])
    Federated learning (FL) is a promising solution to enable many AI applications, where sensitive datasets from distributed clients are needed for collaboratively training a global model. FL allows the clients to participate in the training phase, governed by a central server, without sharing their local data. One of the main challenges of FL is the communication overhead, where the model updates of the participating clients are sent to the central server at each global training round. Over-the-air computation (AirComp) has been recently proposed to alleviate the communication bottleneck where the model updates are sent simultaneously over the multiple-access channel. However, simple averaging of the model updates via AirComp makes the learning process vulnerable to random or intended modifications of the local model updates of some Byzantine clients. In this paper, we propose a transmission and aggregation framework to reduce the effect of such attacks while preserving the benefits of AirComp for FL. For the proposed robust approach, the central server divides the participating clients randomly into groups and allocates a transmission time slot for each group. The updates of the different groups are then aggregated using a robust aggregation technique. We extend our approach to handle the case of non-i.i.d. local data, where a resampling step is added before robust aggregation. We analyze the convergence of the proposed approach for both cases of i.i.d. and non-i.i.d. data and demonstrate that the proposed algorithm converges at a linear rate to a neighborhood of the optimal solution. Experiments on real datasets are provided to confirm the robustness of the proposed approach.  ( 2 min )
    Quantification of Robotic Surgeries with Vision-Based Deep Learning. (arXiv:2205.03028v1 [cs.RO])
    Surgery is a high-stakes domain where surgeons must navigate critical anatomical structures and actively avoid potential complications while achieving the main task at hand. Such surgical activity has been shown to affect long-term patient outcomes. To better understand this relationship, whose mechanics remain unknown for the majority of surgical procedures, we hypothesize that the core elements of surgery must first be quantified in a reliable, objective, and scalable manner. We believe this is a prerequisite for the provision of surgical feedback and modulation of surgeon performance in pursuit of improved patient outcomes. To holistically quantify surgeries, we propose a unified deep learning framework, entitled Roboformer, which operates exclusively on videos recorded during surgery to independently achieve multiple tasks: surgical phase recognition (the what of surgery), gesture classification and skills assessment (the how of surgery). We validated our framework on four video-based datasets of two commonly-encountered types of steps (dissection and suturing) within minimally-invasive robotic surgeries. We demonstrated that our framework can generalize well to unseen videos, surgeons, medical centres, and surgical procedures. We also found that our framework, which naturally lends itself to explainable findings, identified relevant information when achieving a particular task. These findings are likely to instill surgeons with more confidence in our framework's behaviour, increasing the likelihood of clinical adoption, and thus paving the way for more targeted surgical feedback.  ( 2 min )
    Low Dimensional Invariant Embeddings for Universal Geometric Learning. (arXiv:2205.02956v1 [cs.LG])
    This paper studies separating invariants: mappings on $d$-dimensional semi-algebraic subsets of $D$ dimensional Euclidean domains which are invariant to semi-algebraic group actions and separate orbits. The motivation for this study comes from the usefulness of separating invariants in proving universality of equivariant neural network architectures. We observe that in several cases the cardinality of separating invariants proposed in the machine learning literature is much larger than the ambient dimension $D$. As a result, the theoretical universal constructions based on these separating invariants is unrealistically large. Our goal in this paper is to resolve this issue. We show that when a continuous family of semi-algebraic separating invariants is available, separation can be obtained by randomly selecting $2d+1 $ of these invariants. We apply this methodology to obtain an efficient scheme for computing separating invariants for several classical group actions which have been studied in the invariant learning literature. Examples include matrix multiplication actions on point clouds by permutations, rotations, and various other linear groups.  ( 2 min )
    Large Scale Transfer Learning for Differentially Private Image Classification. (arXiv:2205.02973v1 [cs.LG])
    Differential Privacy (DP) provides a formal framework for training machine learning models with individual example level privacy. Training models with DP protects the model against leakage of sensitive data in a potentially adversarial setting. In the field of deep learning, Differentially Private Stochastic Gradient Descent (DP-SGD) has emerged as a popular private training algorithm. Private training using DP-SGD protects against leakage by injecting noise into individual example gradients, such that the trained model weights become nearly independent of the use any particular training example. While this result is quite appealing, the computational cost of training large-scale models with DP-SGD is substantially higher than non-private training. This is further exacerbated by the fact that increasing the number of parameters leads to larger degradation in utility with DP. In this work, we zoom in on the ImageNet dataset and demonstrate that similar to the non-private case, pre-training over-parameterized models on a large public dataset can lead to substantial gains when the model is finetuned privately. Moreover, by systematically comparing private and non-private models across a range of huge batch sizes, we find that similar to non-private setting, choice of optimizer can further improve performance substantially with DP. By switching from DP-SGD to DP-LAMB we saw improvement of up to 20$\%$ points (absolute). Finally, we show that finetuning just the last layer for a \emph{single step} in the full batch setting leads to both SOTA results of 81.7 $\%$ under a wide privacy budget range of $\epsilon \in [4, 10]$ and $\delta$ = $10^{-6}$ while minimizing the computational overhead substantially.  ( 2 min )
    Evaluating Context for Deep Object Detectors. (arXiv:2205.02887v1 [cs.CV])
    Which object detector is suitable for your context sensitive task? Deep object detectors exploit scene context for recognition differently. In this paper, we group object detectors into 3 categories in terms of context use: no context by cropping the input (RCNN), partial context by cropping the featuremap (two-stage methods) and full context without any cropping (single-stage methods). We systematically evaluate the effect of context for each deep detector category. We create a fully controlled dataset for varying context and investigate the context for deep detectors. We also evaluate gradually removing the background context and the foreground object on MS COCO. We demonstrate that single-stage and two-stage object detectors can and will use the context by virtue of their large receptive field. Thus, choosing the best object detector may depend on the application context.  ( 2 min )
    Learn-to-Race Challenge 2022: Benchmarking Safe Learning and Cross-domain Generalisation in Autonomous Racing. (arXiv:2205.02953v1 [cs.RO])
    We present the results of our autonomous racing virtual challenge, based on the newly-released Learn-to-Race (L2R) simulation framework, which seeks to encourage interdisciplinary research in autonomous driving and to help advance the state of the art on a realistic benchmark. Analogous to racing being used to test cutting-edge vehicles, we envision autonomous racing to serve as a particularly challenging proving ground for autonomous agents as: (i) they need to make sub-second, safety-critical decisions in a complex, fast-changing environment; and (ii) both perception and control must be robust to distribution shifts, novel road features, and unseen obstacles. Thus, the main goal of the challenge is to evaluate the joint safety, performance, and generalisation capabilities of reinforcement learning agents on multi-modal perception, through a two-stage process. In the first stage of the challenge, we evaluate an autonomous agent's ability to drive as fast as possible, while adhering to safety constraints. In the second stage, we additionally require the agent to adapt to an unseen racetrack through safe exploration. In this paper, we describe the new L2R Task 2.0 benchmark, with refined metrics and baseline approaches. We also provide an overview of deployment, evaluation, and rankings for the inaugural instance of the L2R Autonomous Racing Virtual Challenge (supported by Carnegie Mellon University, Arrival Ltd., AICrowd, Amazon Web Services, and Honda Research), which officially used the new L2R Task 2.0 benchmark and received over 20,100 views, 437 active participants, 46 teams, and 733 model submissions -- from 88 unique institutions, in 28 different countries. Finally, we release leaderboard results from the challenge and provide description of the two top-ranking approaches in cross-domain model transfer, across multiple sensor configurations and simulated races.  ( 3 min )
    Neural Jacobian Fields: Learning Intrinsic Mappings of Arbitrary Meshes. (arXiv:2205.02904v1 [cs.GR])
    This paper introduces a framework designed to accurately predict piecewise linear mappings of arbitrary meshes via a neural network, enabling training and evaluating over heterogeneous collections of meshes that do not share a triangulation, as well as producing highly detail-preserving maps whose accuracy exceeds current state of the art. The framework is based on reducing the neural aspect to a prediction of a matrix for a single given point, conditioned on a global shape descriptor. The field of matrices is then projected onto the tangent bundle of the given mesh, and used as candidate jacobians for the predicted map. The map is computed by a standard Poisson solve, implemented as a differentiable layer with cached pre-factorization for efficient training. This construction is agnostic to the triangulation of the input, thereby enabling applications on datasets with varying triangulations. At the same time, by operating in the intrinsic gradient domain of each individual mesh, it allows the framework to predict highly-accurate mappings. We validate these properties by conducting experiments over a broad range of scenarios, from semantic ones such as morphing, registration, and deformation transfer, to optimization-based ones, such as emulating elastic deformations and contact correction, as well as being the first work, to our knowledge, to tackle the task of learning to compute UV parameterizations of arbitrary meshes. The results exhibit the high accuracy of the method as well as its versatility, as it is readily applied to the above scenarios without any changes to the framework.  ( 2 min )
    Putting Density Functional Theory to the Test in Machine-Learning-Accelerated Materials Discovery. (arXiv:2205.02967v1 [cond-mat.mtrl-sci])
    Accelerated discovery with machine learning (ML) has begun to provide the advances in efficiency needed to overcome the combinatorial challenge of computational materials design. Nevertheless, ML-accelerated discovery both inherits the biases of training data derived from density functional theory (DFT) and leads to many attempted calculations that are doomed to fail. Many compelling functional materials and catalytic processes involve strained chemical bonds, open-shell radicals and diradicals, or metal-organic bonds to open-shell transition-metal centers. Although promising targets, these materials present unique challenges for electronic structure methods and combinatorial challenges for their discovery. In this Perspective, we describe the advances needed in accuracy, efficiency, and approach beyond what is typical in conventional DFT-based ML workflows. These challenges have begun to be addressed through ML models trained to predict the results of multiple methods or the differences between them, enabling quantitative sensitivity analysis. For DFT to be trusted for a given data point in a high-throughput screen, it must pass a series of tests. ML models that predict the likelihood of calculation success and detect the presence of strong correlation will enable rapid diagnoses and adaptation strategies. These "decision engines" represent the first steps toward autonomous workflows that avoid the need for expert determination of the robustness of DFT-based materials discoveries.  ( 2 min )
    GreenDB: Toward a Product-by-Product Sustainability Database. (arXiv:2205.02908v1 [cs.LG])
    The production, shipping, usage, and disposal of consumer goods have a substantial impact on greenhouse gas emissions and the depletion of resources. Modern retail platforms rely heavily on Machine Learning (ML) for their search and recommender systems. Thus, ML can potentially support efforts towards more sustainable consumption patterns, for example, by accounting for sustainability aspects in product search or recommendations. However, leveraging ML potential for reaching sustainability goals requires data on sustainability. Unfortunately, no open and publicly available database integrates sustainability information on a product-by-product basis. In this work, we present the GreenDB, which fills this gap. Based on search logs of millions of users, we prioritize which products users care about most. The GreenDB schema extends the well-known schema.org Product definition and can be readily integrated into existing product catalogs to improve sustainability information available for search and recommendation experiences. We present our proof of concept implementation of a scraping system that creates the GreenDB dataset.  ( 2 min )
    Lagrangian PINNs: A causality-conforming solution to failure modes of physics-informed neural networks. (arXiv:2205.02902v1 [cs.LG])
    Physics-informed neural networks (PINNs) leverage neural-networks to find the solutions of partial differential equation (PDE)-constrained optimization problems with initial conditions and boundary conditions as soft constraints. These soft constraints are often considered to be the sources of the complexity in the training phase of PINNs. Here, we demonstrate that the challenge of training (i) persists even when the boundary conditions are strictly enforced, and (ii) is closely related to the Kolmogorov n-width associated with problems demonstrating transport, convection, traveling waves, or moving fronts. Given this realization, we describe the mechanism underlying the training schemes such as those used in eXtended PINNs (XPINN), curriculum regularization, and sequence-to-sequence learning. For an important category of PDEs, i.e., governed by non-linear convection-diffusion equation, we propose reformulating PINNs on a Lagrangian frame of reference, i.e., LPINNs, as a PDE-informed solution. A parallel architecture with two branches is proposed. One branch solves for the state variables on the characteristics, and the second branch solves for the low-dimensional characteristics curves. The proposed architecture conforms to the causality innate to the convection, and leverages the direction of travel of the information in the domain. Finally, we demonstrate that the loss landscapes of LPINNs are less sensitive to the so-called "complexity" of the problems, compared to those in the traditional PINNs in the Eulerian framework.  ( 2 min )
    Understanding Transfer Learning for Chest Radiograph Clinical Report Generation with Modified Transformer Architectures. (arXiv:2205.02841v1 [eess.IV])
    The image captioning task is increasingly prevalent in artificial intelligence applications for medicine. One important application is clinical report generation from chest radiographs. The clinical writing of unstructured reports is time consuming and error-prone. An automated system would improve standardization, error reduction, time consumption, and medical accessibility. In this paper we demonstrate the importance of domain specific pre-training and propose a modified transformer architecture for the medical image captioning task. To accomplish this, we train a series of modified transformers to generate clinical reports from chest radiograph image input. These modified transformers include: a meshed-memory augmented transformer architecture with visual extractor using ImageNet pre-trained weights, a meshed-memory augmented transformer architecture with visual extractor using CheXpert pre-trained weights, and a meshed-memory augmented transformer whose encoder is passed the concatenated embeddings using both ImageNet pre-trained weights and CheXpert pre-trained weights. We use BLEU(1-4), ROUGE-L, CIDEr, and the clinical CheXbert F1 scores to validate our models and demonstrate competitive scores with state of the art models. We provide evidence that ImageNet pre-training is ill-suited for the medical image captioning task, especially for less frequent conditions (eg: enlarged cardiomediastinum, lung lesion, pneumothorax). Furthermore, we demonstrate that the double feature model improves performance for specific medical conditions (edema, consolidation, pneumothorax, support devices) and overall CheXbert F1 score, and should be further developed in future work. Such a double feature model, including both ImageNet pre-training as well as domain specific pre-training, could be used in a wide range of image captioning models in medicine.  ( 2 min )
    Exploiting Ligand Additivity for Transferable Machine Learning of Multireference Character Across Known Transition Metal Complex Ligands. (arXiv:2205.02879v1 [cond-mat.mtrl-sci])
    Accurate virtual high-throughput screening (VHTS) of transition metal complexes (TMCs) remains challenging due to the possibility of high multi-reference (MR) character that complicates property evaluation. We compute MR diagnostics for over 5,000 ligands present in previously synthesized transition metal complexes in the Cambridge Structural Database (CSD). To accomplish this task, we introduce an iterative approach for consistent ligand charge assignment for ligands in the CSD. Across this set, we observe that MR character correlates linearly with the inverse value of the averaged bond order over all bonds in the molecule. We then demonstrate that ligand additivity of MR character holds in TMCs, which suggests that the TMC MR character can be inferred from the sum of the MR character of the ligands. Encouraged by this observation, we leverage ligand additivity and develop a ligand-derived machine learning representation to train neural networks to predict the MR character of TMCs from properties of the constituent ligands. This approach yields models with excellent performance and superior transferability to unseen ligand chemistry and compositions.  ( 2 min )
    Generative Adversarial Network Based Synthetic Learning and a Novel Domain Relevant Loss Term for Spine Radiographs. (arXiv:2205.02843v1 [eess.IV])
    Problem: There is a lack of big data for the training of deep learning models in medicine, characterized by the time cost of data collection and privacy concerns. Generative adversarial networks (GANs) offer both the potential to generate new data, as well as to use this newly generated data, without inclusion of patients' real data, for downstream applications. Approach: A series of GANs were trained and applied for a downstream computer vision spine radiograph abnormality classification task. Separate classifiers were trained with either access or no access to the original imaging. Trained GANs included a conditional StyleGAN2 with adaptive discriminator augmentation, a conditional StyleGAN2 with adaptive discriminator augmentation to generate spine radiographs conditional on lesion type, and using a novel clinical loss term for the generator a StyleGAN2 with adaptive discriminator augmentation conditional on abnormality (SpineGAN). Finally, a differential privacy imposed StyleGAN2 with adaptive discriminator augmentation conditional on abnormality was trained and an ablation study was performed on its differential privacy impositions. Key Results: We accomplish GAN generation of synthetic spine radiographs without meaningful input for the first time from a literature review. We further demonstrate the success of synthetic learning for the spine domain with a downstream clinical classification task (AUC of 0.830 using synthetic data compared to AUC of 0.886 using the real data). Importantly, the introduction of a new clinical loss term for the generator was found to increase generation recall as well as accelerate model training. Lastly, we demonstrate that, in a limited size medical dataset, differential privacy impositions severely impede GAN training, finding that this is specifically due to the requirement for gradient perturbation with noise.  ( 2 min )
    AdaTriplet: Adaptive Gradient Triplet Loss with Automatic Margin Learning for Forensic Medical Image Matching. (arXiv:2205.02849v1 [eess.IV])
    This paper tackles the challenge of forensic medical image matching (FMIM) using deep neural networks (DNNs). FMIM is a particular case of content-based image retrieval (CBIR). The main challenge in FMIM compared to the general case of CBIR, is that the subject to whom a query image belongs may be affected by aging and progressive degenerative disorders, making it difficult to match data on a subject level. CBIR with DNNs is generally solved by minimizing a ranking loss, such as Triplet loss (TL), computed on image representations extracted by a DNN from the original data. TL, in particular, operates on triplets: anchor, positive (similar to anchor) and negative (dissimilar to anchor). Although TL has been shown to perform well in many CBIR tasks, it still has limitations, which we identify and analyze in this work. In this paper, we introduce (i) the AdaTriplet loss -- an extension of TL whose gradients adapt to different difficulty levels of negative samples, and (ii) the AutoMargin method -- a technique to adjust hyperparameters of margin-based losses such as TL and our proposed loss dynamically. Our results are evaluated on two large-scale benchmarks for FMIM based on the Osteoarthritis Initiative and Chest X-ray-14 datasets. The codes allowing replication of this study have been made publicly available at \url{https://github.com/Oulu-IMEDS/AdaTriplet}.  ( 2 min )
  • Open

    On boundary conditions parametrized by analytic functions. (arXiv:2205.03185v1 [cs.LG])
    Computer algebra can answer various questions about partial differential equations using symbolic algorithms. However, the inclusion of data into equations is rare in computer algebra. Therefore, recently, computer algebra models have been combined with Gaussian processes, a regression model in machine learning, to describe the behavior of certain differential equations under data. While it was possible to describe polynomial boundary conditions in this context, we extend these models to analytic boundary conditions. Additionally, we describe the necessary algorithms for Gr\"obner and Janet bases of Weyl algebras with certain analytic coefficients. Using these algorithms, we provide examples of divergence-free flow in domains bounded by analytic functions and adapted to observations.  ( 2 min )
    Tensor Principal Component Analysis in High Dimensional CP Models. (arXiv:2108.04428v3 [stat.ML] UPDATED)
    The CP decomposition for high dimensional non-orthogonal spiked tensors is an important problem with broad applications across many disciplines. However, previous works with theoretical guarantee typically assume restrictive incoherence conditions on the basis vectors for the CP components. In this paper, we propose new computationally efficient composite PCA and concurrent orthogonalization algorithms for tensor CP decomposition with theoretical guarantees under mild incoherence conditions. The composite PCA applies the principal component or singular value decompositions twice, first to a matrix unfolding of the tensor data to obtain singular vectors and then to the matrix folding of the singular vectors obtained in the first step. It can be used as an initialization for any iterative optimization schemes for the tensor CP decomposition. The concurrent orthogonalization algorithm iteratively estimates the basis vector in each mode of the tensor by simultaneously applying projections to the orthogonal complements of the spaces generated by other CP components in other modes. It is designed to improve the alternating least squares estimator and other forms of the high order orthogonal iteration for tensors with low or moderately high CP ranks, and it is guaranteed to converge rapidly when the error of any given initial estimator is bounded by a small constant. Our theoretical investigation provides estimation accuracy and convergence rates for the two proposed algorithms. Both proposed algorithms are applicable to deterministic tensor, its noisy version, and the order-$2K$ covariance tensor of order-$K$ tensor data in a factor model with uncorrelated factors. Our implementations on synthetic data demonstrate significant practical superiority of our approach over existing methods.  ( 2 min )
    Gaussian Processes for Missing Value Imputation. (arXiv:2204.04648v2 [stat.ML] UPDATED)
    Missing values are common in many real-life datasets. However, most of the current machine learning methods can not handle missing values. This means that they should be imputed beforehand. Gaussian Processes (GPs) are non-parametric models with accurate uncertainty estimates that combined with sparse approximations and stochastic variational inference scale to large data sets. Sparse GPs can be used to compute a predictive distribution for missing data. Here, we present a hierarchical composition of sparse GPs that is used to predict missing values at each dimension using all the variables from the other dimensions. We call the approach missing GP (MGP). MGP can be trained simultaneously to impute all observed missing values. Specifically, it outputs a predictive distribution for each missing value that is then used in the imputation of other missing values. We evaluate MGP in one private clinical data set and four UCI datasets with a different percentage of missing values. We compare the performance of MGP with other state-of-the-art methods for imputing missing values, including variants based on sparse GPs and deep GPs. The results obtained show a significantly better performance of MGP.  ( 2 min )
    Bayesian Sample Size Prediction for Online Activity. (arXiv:2111.12157v2 [stat.ML] UPDATED)
    In many contexts it is useful to predict the number of individuals in some population who will initiate a particular activity during a given period. For example, the number of users who will install a software update, the number of customers who will use a new feature on a website or who will participate in an A/B test. In practical settings, there is heterogeneity amongst individuals with regard to the distribution of time until they will initiate. For these reasons it is inappropriate to assume that the number of new individuals observed on successive days will be identically distributed. Given observations on the number of unique users participating in an initial period, we present a simple but novel Bayesian method for predicting the number of additional individuals who will participate during a subsequent period. We illustrate the performance of the method in predicting sample size in online experimentation.  ( 2 min )
    DADApy: Distance-based Analysis of DAta-manifolds in Python. (arXiv:2205.03373v1 [cs.LG])
    DADApy is a python software package for analysing and characterising high-dimensional data manifolds. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics. We review the main functionalities of the package and exemplify its usage in toy cases and in a real-world application. The package is freely available under the open-source Apache 2.0 license and can be downloaded from the Github page https://github.com/sissa-data-science/DADApy.  ( 2 min )
    Designing Robust Biotechnological Processes Regarding Variabilities using Multi-Objective Optimization Applied to a Biopharmaceutical Seed Train Design. (arXiv:2205.03261v1 [cs.LG])
    Development and optimization of biopharmaceutical production processes with cell cultures is cost- and time-consuming and often performed rather empirically. Efficient optimization of multiple-objectives like process time, viable cell density, number of operating steps & cultivation scales, required medium, amount of product as well as product quality depicts a promising approach. This contribution presents a workflow which couples uncertainty-based upstream simulation and Bayes optimization using Gaussian processes. Its application is demonstrated in a simulation case study for a relevant industrial task in process development, the design of a robust cell culture expansion process (seed train), meaning that despite uncertainties and variabilities concerning cell growth, low variations of viable cell density during the seed train are obtained. Compared to a non-optimized reference seed train, the optimized process showed much lower deviation rates regarding viable cell densities (<~10% instead of 41.7%) using 5 or 4 shake flask scales and seed train duration could be reduced by 56 h from 576 h to 520 h. Overall, it is shown that applying Bayes optimization allows for optimization of a multi-objective optimization function with several optimizable input variables and under a considerable amount of constraints with a low computational effort. This approach provides the potential to be used in form of a decision tool, e.g. for the choice of an optimal and robust seed train design or for further optimization tasks within process development.  ( 2 min )
    Learning Optimal Conformal Classifiers. (arXiv:2110.09192v3 [cs.LG] UPDATED)
    Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's predictions, e.g., its probability estimates, to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on mini-batches during training. Compared to standard training, ConfTr reduces the average confidence set size (inefficiency) of state-of-the-art CP methods applied after training. Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.  ( 2 min )
    Low-rank Tensor Learning with Nonconvex Overlapped Nuclear Norm Regularization. (arXiv:2205.03059v1 [cs.LG])
    Nonconvex regularization has been popularly used in low-rank matrix learning. However, extending it for low-rank tensor learning is still computationally expensive. To address this problem, we develop an efficient solver for use with a nonconvex extension of the overlapped nuclear norm regularizer. Based on the proximal average algorithm, the proposed algorithm can avoid expensive tensor folding/unfolding operations. A special "sparse plus low-rank" structure is maintained throughout the iterations, and allows fast computation of the individual proximal steps. Empirical convergence is further improved with the use of adaptive momentum. We provide convergence guarantees to critical points on smooth losses and also on objectives satisfying the Kurdyka-{\L}ojasiewicz condition. While the optimization problem is nonconvex and nonsmooth, we show that its critical points still have good statistical performance on the tensor completion problem. Experiments on various synthetic and real-world data sets show that the proposed algorithm is efficient in both time and space and more accurate than the existing state-of-the-art.  ( 2 min )
    Probabilistic learning constrained by realizations using a weak formulation of Fourier transform of probability measures. (arXiv:2205.03078v1 [stat.ML])
    This paper deals with the taking into account a given set of realizations as constraints in the Kullback-Leibler minimum principle, which is used as a probabilistic learning algorithm. This permits the effective integration of data into predictive models. We consider the probabilistic learning of a random vector that is made up of either a quantity of interest (unsupervised case) or the couple of the quantity of interest and a control parameter (supervised case). A training set of independent realizations of this random vector is assumed to be given and to be generated with a prior probability measure that is unknown. A target set of realizations of the QoI is available for the two considered cases. The framework is the one of non-Gaussian problems in high dimension. A functional approach is developed on the basis of a weak formulation of the Fourier transform of probability measures (characteristic functions). The construction makes it possible to take into account the target set of realizations of the QoI in the Kullback-Leibler minimum principle. The proposed approach allows for estimating the posterior probability measure of the QoI (unsupervised case) or of the posterior joint probability measure of the QoI with the control parameter (supervised case). The existence and the uniqueness of the posterior probability measure is analyzed for the two cases. The numerical aspects are detailed in order to facilitate the implementation of the proposed method. The presented application in high dimension demonstrates the efficiency and the robustness of the proposed algorithm.  ( 2 min )
    Benchmarking Econometric and Machine Learning Methodologies in Nowcasting. (arXiv:2205.03318v1 [stat.ML])
    Nowcasting can play a key role in giving policymakers timelier insight to data published with a significant time lag, such as final GDP figures. Currently, there are a plethora of methodologies and approaches for practitioners to choose from. However, there lacks a comprehensive comparison of these disparate approaches in terms of predictive performance and characteristics. This paper addresses that deficiency by examining the performance of 12 different methodologies in nowcasting US quarterly GDP growth, including all the methods most commonly employed in nowcasting, as well as some of the most popular traditional machine learning approaches. Performance was assessed on three different tumultuous periods in US economic history: the early 1980s recession, the 2008 financial crisis, and the COVID crisis. The two best performing methodologies in the analysis were long short-term memory artificial neural networks (LSTM) and Bayesian vector autoregression (BVAR). To facilitate further application and testing of each of the examined methodologies, an open-source repository containing boilerplate code that can be applied to different datasets is published alongside the paper, available at: github.com/dhopp1/nowcasting_benchmark.  ( 2 min )
    Scalable computation of prediction intervals for neural networks via matrix sketching. (arXiv:2205.03194v1 [stat.ML])
    Accounting for the uncertainty in the predictions of modern neural networks is a challenging and important task in many domains. Existing algorithms for uncertainty estimation require modifying the model architecture and training procedure (e.g., Bayesian neural networks) or dramatically increase the computational cost of predictions such as approaches based on ensembling. This work proposes a new algorithm that can be applied to a given trained neural network and produces approximate prediction intervals. The method is based on the classical delta method in statistics but achieves computational efficiency by using matrix sketching to approximate the Jacobian matrix. The resulting algorithm is competitive with state-of-the-art approaches for constructing predictive intervals on various regression datasets from the UCI repository.  ( 2 min )
    Differentially Private Generalized Linear Models Revisited. (arXiv:2205.03014v1 [cs.LG])
    We study the problem of $(\epsilon,\delta)$-differentially private learning of linear predictors with convex losses. We provide results for two subclasses of loss functions. The first case is when the loss is smooth and non-negative but not necessarily Lipschitz (such as the squared loss). For this case, we establish an upper bound on the excess population risk of $\tilde{O}\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^* \Vert^2}{(n\epsilon)^{2/3}},\frac{\sqrt{d}\Vert w^*\Vert^2}{n\epsilon}\right\}\right)$, where $n$ is the number of samples, $d$ is the dimension of the problem, and $w^*$ is the minimizer of the population risk. Apart from the dependence on $\Vert w^\ast\Vert$, our bound is essentially tight in all parameters. In particular, we show a lower bound of $\tilde{\Omega}\left(\frac{1}{\sqrt{n}} + {\min\left\{\frac{\Vert w^*\Vert^{4/3}}{(n\epsilon)^{2/3}}, \frac{\sqrt{d}\Vert w^*\Vert}{n\epsilon}\right\}}\right)$. We also revisit the previously studied case of Lipschitz losses [SSTT20]. For this case, we close the gap in the existing work and show that the optimal rate is (up to log factors) $\Theta\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^*\Vert}{\sqrt{n\epsilon}},\frac{\sqrt{\text{rank}}\Vert w^*\Vert}{n\epsilon}\right\}\right)$, where $\text{rank}$ is the rank of the design matrix. This improves over existing work in the high privacy regime. Finally, our algorithms involve a private model selection approach that we develop to enable attaining the stated rates without a-priori knowledge of $\Vert w^*\Vert$.  ( 2 min )
    Active Offline Policy Selection. (arXiv:2106.10251v4 [cs.LG] UPDATED)
    This paper addresses the problem of policy selection in domains with abundant logged data, but with a restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and recommendation domains among others. Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation. Yet, large amounts of online interactions are often not possible in practice. To overcome this problem, we introduce active offline policy selection - a novel sequential decision approach that combines logged data with online interaction to identify the best policy. We use OPE estimates to warm start the online evaluation. Then, in order to utilize the limited environment interactions wisely we decide which policy to evaluate next based on a Bayesian optimization method with a kernel that represents policy similarity. We use multiple benchmarks, including real-world robotics, with a large number of candidate policies to show that the proposed approach improves upon state-of-the-art OPE estimates and pure online policy evaluation.  ( 2 min )
    Solar: $L_0$ solution path averaging for fast and accurate variable selection in high-dimensional data. (arXiv:2007.15707v3 [stat.ML] UPDATED)
    We propose a new variable selection algorithm, subsample-ordered least-angle regression (solar), and its coordinate descent generalization, solar-cd. Solar re-constructs lasso paths using the $L_0$ norm and averages the resulting solution paths across subsamples. Path averaging retains the ranking information of the informative variables while averaging out sensitivity to high dimensionality, improving variable selection stability, efficiency, and accuracy. We prove that: (i) with a high probability, path averaging perfectly separates informative variables from redundant variables on the average $L_0$ path; (ii) solar variable selection is consistent and accurate; and (iii) the probability that solar omits weak signals is controllable for finite sample size. We also demonstrate that: (i) solar yields, with less than $1/3$ of the lasso computation load, substantial improvements over lasso in terms of the sparsity (64-84\% reduction in redundant variable selection) and accuracy of variable selection; (ii) compared with the lasso safe/strong rule and variable screening, solar largely avoids selection of redundant variables and rejection of informative variables in the presence of complicated dependence structures; (iii) the sparsity and stability of solar conserves residual degrees of freedom for data-splitting hypothesis testing, improving the accuracy of post-selection inference on weak signals with limited $n$; (iv) replacing lasso with solar in bootstrap selection (e.g., bolasso or stability selection) produces a multi-layer variable ranking scheme that improves selection sparsity and ranking accuracy with the computation load of only one lasso realization; and (v) given the computation resources, solar bootstrap selection is substantially faster (98\% lower computation time) than the theoretical maximum speedup for parallelized bootstrap lasso (confirmed by Amdahl's law).  ( 2 min )
    R-GCN: The R Could Stand for Random. (arXiv:2203.02424v2 [cs.LG] UPDATED)
    The inception of the Relational Graph Convolutional Network (R-GCN) marked a milestone in the Semantic Web domain as a widely cited method that generalises end-to-end hierarchical representation learning to Knowledge Graphs (KGs). R-GCNs generate representations for nodes of interest by repeatedly aggregating parameterised, relation-specific transformations of their neighbours. However, in this paper, we argue that the the R-GCN's main contribution lies in this "message passing" paradigm, rather than the learned weights. To this end, we introduce the "Random Relational Graph Convolutional Network" (RR-GCN), which leaves all parameters untrained and thus constructs node embeddings by aggregating randomly transformed random representations from neighbours, i.e., with no learned parameters. We empirically show that RR-GCNs can compete with fully trained R-GCNs in both node classification and link prediction settings.  ( 2 min )
    Belief propagation for permutations, rankings, and partial orders. (arXiv:2110.00513v2 [cs.AI] UPDATED)
    Many datasets give partial information about an ordering or ranking by indicating which team won a game, which item a user prefers, or who infected whom. We define a continuous spin system whose Gibbs distribution is the posterior distribution on permutations, given a probabilistic model of these interactions. Using the cavity method we derive a belief propagation algorithm that computes the marginal distribution of each node's position. In addition, the Bethe free energy lets us approximate the number of linear extensions of a partial order and perform model selection between competing probabilistic models, such as the Bradley-Terry-Luce model of noisy comparisons and its cousins.  ( 2 min )
    What Makes A Good Fisherman? Linear Regression under Self-Selection Bias. (arXiv:2205.03246v1 [math.ST])
    In the classical setting of self-selection, the goal is to learn $k$ models, simultaneously from observations $(x^{(i)}, y^{(i)})$ where $y^{(i)}$ is the output of one of $k$ underlying models on input $x^{(i)}$. In contrast to mixture models, where we observe the output of a randomly selected model, here the observed model depends on the outputs themselves, and is determined by some known selection criterion. For example, we might observe the highest output, the smallest output, or the median output of the $k$ models. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics and applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium. In this work, we present the first computationally and statistically efficient estimation algorithms for the most standard setting of this problem where the models are linear. In the known-index case, we require poly$(1/\varepsilon, k, d)$ sample and time complexity to estimate all model parameters to accuracy $\varepsilon$ in $d$ dimensions, and can accommodate quite general selection criteria. In the more challenging unknown-index case, even the identifiability of the linear models (from infinitely many samples) was not known. We show three results in this case for the commonly studied $\max$ self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general $k$ we provide an algorithm with poly$(d) \exp(\text{poly}(k))$ sample and time complexity to estimate the regression parameters up to error $1/\text{poly}(k)$, and (3) for $k = 2$ we provide an algorithm for any error $\varepsilon$ and poly$(d, 1/\varepsilon)$ sample and time complexity.  ( 2 min )
    Optimally tackling covariate shift in RKHS-based nonparametric regression. (arXiv:2205.02986v1 [math.ST])
    We study the covariate shift problem in the context of nonparametric regression over a reproducing kernel Hilbert space (RKHS). We focus on two natural families of covariate shift problems defined using the likelihood ratios between the source and target distributions. When the likelihood ratios are uniformly bounded, we prove that the kernel ridge regression (KRR) estimator with a carefully chosen regularization parameter is minimax rate-optimal (up to a log factor) for a large family of RKHSs with regular kernel eigenvalues. Interestingly, KRR does not require full knowledge of the likelihood ratio apart from an upper bound on it. In striking contrast to the standard statistical setting without covariate shift, we also demonstrate that a na\"\i ve estimator, which minimizes the empirical risk over the function class, is strictly suboptimal under covariate shift as compared to KRR. We then address the larger class of covariate shift problems where likelihood ratio is possibly unbounded yet has a finite second moment. Here, we show via careful simulations that KRR fails to attain the optimal rate. Instead, we propose a reweighted KRR estimator that weights samples based on a careful truncation of the likelihood ratios. Again, we are able to show that this estimator is minimax optimal, up to logarithmic factors.  ( 2 min )
    Lagrangian PINNs: A causality-conforming solution to failure modes of physics-informed neural networks. (arXiv:2205.02902v1 [cs.LG])
    Physics-informed neural networks (PINNs) leverage neural-networks to find the solutions of partial differential equation (PDE)-constrained optimization problems with initial conditions and boundary conditions as soft constraints. These soft constraints are often considered to be the sources of the complexity in the training phase of PINNs. Here, we demonstrate that the challenge of training (i) persists even when the boundary conditions are strictly enforced, and (ii) is closely related to the Kolmogorov n-width associated with problems demonstrating transport, convection, traveling waves, or moving fronts. Given this realization, we describe the mechanism underlying the training schemes such as those used in eXtended PINNs (XPINN), curriculum regularization, and sequence-to-sequence learning. For an important category of PDEs, i.e., governed by non-linear convection-diffusion equation, we propose reformulating PINNs on a Lagrangian frame of reference, i.e., LPINNs, as a PDE-informed solution. A parallel architecture with two branches is proposed. One branch solves for the state variables on the characteristics, and the second branch solves for the low-dimensional characteristics curves. The proposed architecture conforms to the causality innate to the convection, and leverages the direction of travel of the information in the domain. Finally, we demonstrate that the loss landscapes of LPINNs are less sensitive to the so-called "complexity" of the problems, compared to those in the traditional PINNs in the Eulerian framework.  ( 2 min )
    Explainable multi-class anomaly detection on functional data. (arXiv:2205.02935v1 [stat.ML])
    In this paper we describe an approach for anomaly detection and its explainability in multivariate functional data. The anomaly detection procedure consists of transforming the series into a vector of features and using an Isolation forest algorithm. The explainable procedure is based on the computation of the SHAP coefficients and on the use of a supervised decision tree. We apply it on simulated data to measure the performance of our method and on real data coming from industry.  ( 2 min )

  • Open

    Hello, is there any suggestions of scalable multiagent RL environements ( I want to be able to change the number of agents in the env). Thank you
    submitted by /u/Ok_Lab_2750 [link] [comments]  ( 1 min )
    "BARL: An Experimental Design Perspective on Model-Based Reinforcement Learning" (on Mehta et al 2021)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    dialogue history in dialogue generation
    what is dialogue history in dialogue generation in the paper titled "deep Reinforcement learning for Dialogue generation" in section 4.3 https://arxiv.org/abs/1606.01541 submitted by /u/Western-Age3148 [link] [comments]  ( 1 min )
    I’ve published a new repo with A collection of Deep RL algorithms implemented with PyTorch
    submitted by /u/Top_Serve_2348 [link] [comments]
    Will training in multi agent reinforcement learning converge? Assume there are two agents, "A get stronger, B learn from errors, B get stronger, A learn from errors so on .....", will this happen?
    submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
    reinforcement learning in finance and trading
    how is it different than time series analysis or since it has enormous amount of data to analyze, is it worth using RL in this field... if so plz link the github repository containing RL papers on finance and trading submitted by /u/Western-Age3148 [link] [comments]  ( 1 min )
    Question about Vectorized Environments and GPU/Cuda training
    I have a bunch of questions that I can't seem to find any good answers to, so I might as well ask them here. I'm learning stablebaselines3 by creating a snake game. Pretty simple. I've enabled cuda training, so that I can use my 1080Ti. When I vectorize my custom environment, I do see some efficiency gains in training, but not many (using PPO model): 1,000,000 timestep training: n_envs = 4 --> 20 minutes n_envs = 12 --> 16 minutes n_envs = 100 --> 13 minutes ​ The time efficiency does not scale linearly with the amount of environments running concurrently. So I'm guessing that either I'm doing something wrong or perhaps the efficiency gains aren't that impressive in general. It seems that the 1080 Ti has somewhere around 3000 CUDA cores. Does that mean I can run 3000 concurrent environments? Also... after 1 million timesteps the snake AI is still pretty dumb. Is this normal, or should I see relatively good results after 1 million timesteps, and if I don't that's an indication that my reward system or something else isn't good? Thanks! submitted by /u/HaikuHaiku [link] [comments]  ( 1 min )
  • Open

    Top 10 Leading Universities in AI Research
    Deep learning, natural language processing, data analytics, and big-data mining are fields of Artificial Intelligence (AI), and many companies are looking for professionals in these fields. A professional degree in AI from a reputed university will help you get started in this industry. Read more submitted by /u/ridamughal110 [link] [comments]
    Is AI cost-effective?
    Artificial Intelligence (AI) has the potential to disrupt the global economy by bringing significant changes to the industry. Corporations are recognizing AI's competitive edge, and as a result, a growing number of companies are implementing AI technology and reaping significant benefits. Read more submitted by /u/ridamughal110 [link] [comments]
    Overview of over 160 Biases (Belief, decision-making & behavioral, Social, Memory)
    Cognitive biases are an essential and serious issue. Especially for people who deal with data (algorithms). I've compiled a list (pdf/EPUB) of over 160 biases (mainly from Wikipedia). Maybe this is useful for some. These biases affect belief formation, reasoning processes, business & economic decisions, and human behavior in general. Let's learn more about our human biases to make less biased conclusions in the future. The PDF/EPUB can be downloaded for free on leanpub: Cognitive Biases: A Brief Overview of Over 160 Cognitive Biases submitted by /u/Philo167 [link] [comments]  ( 1 min )
    Poker AI Plays Itself, Match 3
    submitted by /u/bluboxsw [link] [comments]
    Alibaba Introduces ‘FederatedScope’: An Easy-To-Use Federated Learning Platform Providing Comprehensive Functionalities
    Federated learning is a machine learning technique that trains a model over several dispersed nodes or hosts, as the name suggests. Each node utilizes its own training data. If the model parameters are shared between nodes rather than the raw data, the data can be kept private. Due to privacy concerns, obtaining training data to design and evolve machine learning models is increasingly being questioned, and federated learning can help alleviate some of these issues. The Chinese e-commerce behemoth, Alibaba, has created a federated learning platform that allows machine learning algorithms to be constructed without sharing training data. Continue Reading Github: https://github.com/alibaba/FederatedScope submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Uniting human brains and computers: A new type of AI
    submitted by /u/bendee983 [link] [comments]
    The ML pipeline & Types of roles in ML
    submitted by /u/mr-minion [link] [comments]
    Presentation of Project: The Ark Evolving
    I am a programmer of intermediate knowledge and this is a project in development only by me, it is based on creating and executing an artificial intelligence to play the Magic Arena card game in an advanced way, the AI ​​will have supervised learning using the minimax algorithm. At the moment I have the concept, and that is why I am making this post to ask you for advice and recommendations that you can give me. submitted by /u/Merelex_Buk-12 [link] [comments]  ( 1 min )
    Iterative to Launch Open Source Tool, First to Train Machine Learning Models on Any Cloud Using HashiCorp’s Terraform Solution
    submitted by /u/thumbsdrivesmecrazy [link] [comments]
    Apple loses Ian Goodfellow, director of machine learning, inventor of GANs, over return-to-office policy
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 1 min )
    Watch this model explain, debug, document + generate test cases and answer questions about your code!
    submitted by /u/landongarrison [link] [comments]  ( 1 min )
  • Open

    [D] Inequality of the paper [On the uncertainty principle of neural networks]
    I'm reading the paper [On the uncertainty principle of neural networks](https://arxiv.org/abs/2205.01493), and i'm doubting one inequality. In the middle of the equation (7) and (8), there's a part that states If we use the property ~~~ ​ https://preview.redd.it/tz6k5htt0cy81.png?width=774&format=png&auto=webp&s=99fc8b5701e00fdf253e2f52eb2fde240772fe5e Is the yellow inequality true? I think the inequality should be opposite, due to arithmetic–geometric mean inequality. Thks. submitted by /u/jryoungw2035 [link] [comments]  ( 1 min )
    [News] This “amateur” programmer once used 50 sheets of 1080Ti to fight cancer
    submitted by /u/coolwulf [link] [comments]  ( 1 min )
    [P] [python] Trading Bot! Our team is looking for 2 new team members! We already done a lot! Info in text :)
    Hello, Our team has four team members based in Europe, 27-39 yrs old. We are doing the trading bot from the scratch based on strategy related to momentum and fractal geometry. I was manually trading that strategy for 3 or more years and made it unique. Live trader (the bot) is already built, but it requires more development. Backtrader is 90% completed. Results are promising and that is what keep us pushing the project. All team members share the source code as our reward. We decided to acquire 1 or 2 team members. Please feel free to approach me, Thanks! submitted by /u/Prior_Regret_4138 [link] [comments]  ( 1 min )
    [D] Machine Learning - WAYR (What Are You Reading) - Week 137
    This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 131-140 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 131 Week 2 Week 12 Week 22 Week 32 Week 42 Week 52 Week 62 Week 72 Week 82 Week 92 Week 102 Week 112 Week 122 Week 132 Week 3 Week 13 Week 23 Week 33 Week 43 Week 53 Week 63 Week 73 Week 83 Week 93 Week 103 Week 113 Week 123 Week 133 Week 4 Week 14 Week 24 Week 34 Week 44 Week 54 Week 64 Week 74 Week 84 Week 94 Week 104 Week 114 Week 124 Week 134 Week 5 Week 15 Week 25 Week 35 Week 45 Week 55 Week 65 Week 75 Week 85 Week 95 Week 105 Week 115 Week 125 Week 135 Week 6 Week 16 Week 26 Week 36 Week 46 Week 56 Week 66 Week 76 Week 86 Week 96 Week 106 Week 116 Week 126 Week 136 Week 7 Week 17 Week 27 Week 37 Week 47 Week 57 Week 67 Week 77 Week 87 Week 97 Week 107 Week 117 Week 127 Week 8 Week 18 Week 28 Week 38 Week 48 Week 58 Week 68 Week 78 Week 88 Week 98 Week 108 Week 118 Week 128 Week 9 Week 19 Week 29 Week 39 Week 49 Week 59 Week 69 Week 79 Week 89 Week 99 Week 109 Week 119 Week 129 Week 10 Week 20 Week 30 Week 40 Week 50 Week 60 Week 70 Week 80 Week 90 Week 100 Week 110 Week 120 Week 130 Most upvoted papers two weeks ago: /u/CatalyzeX_code_bot: Paper link Besides that, there are no rules, have fun. submitted by /u/ML_WAYR_bot [link] [comments]  ( 1 min )
    [D] Is WGAN-GP gradient penalty applicable to the generator?
    I am currently studying a set of papers on GAN-driven Video-to-Speech synthesis, and one of those in particular has got me scratching my head in confusion. The study describes an architecture based on WGAN-GP, a modification of WGAN which aims to enforce 1-Lipschitz constraint by means of a gradient penalty. Specifically, the original paper presents an algorithm in which said penalty is applied to the critic, keeping the generator's loss function relatively simple. In this paper, however, the loss functions specified show the penalty applied to the generator instead, with the critic's loss function being identical to that of a standard WGAN. They do, however, state right below that " The coefficient for the gradient penalty λ is kept at the value of 10 for both critics..." https://preview.redd.it/gdp35hihsay81.png?width=793&format=png&auto=webp&s=7eedf0103ee3b33afe991148ac5a7552d262f35f So the question is: which is it? Should I consider this to be a typo of some sort, or am I not understanding something in this paper? I can understand how keeping the function's gradient norm below 1 may help when applied to the critic, but what benefit could it have for the generator? Are the other loss functions used in the paper somehow relevant perhaps? submitted by /u/ShujiMikami [link] [comments]  ( 1 min )
    [D] Looking for a python library that implements decision tree regressors handling categorical features
    Hello community, I am a bit baffled scikit-learn does not support this. I am looking for a good python library that enables fitting a decision tree regressor on both numerical and categorical features (non-binary tree). Could you point me to one if you know any, please? Thanks! (PS: This is for visualization and interpretability, so things like catboost won't do) EDIT: I think I may have found what I was looking for: chefboost (PS: unfortunately it is a bit simplistic and does not support pruning atm) Also xMattC3 pointed this ongoing PR for scikit learn: NOCATS submitted by /u/yannbouteiller [link] [comments]  ( 2 min )
    [P] I’ve been trying to understand the limits of some of the available machine learning models out there. Built an app that lets you try a mix of CLIP from Open AI + Apple’s version of MobileNet, and more directly on your phone's camera roll.
    submitted by /u/Playgroundai [link] [comments]  ( 1 min )
    [N] We're sharing details on new methods to more efficiently train Vision Transformers, a model architecture that can achieve state-of-the-art results on a wide range of computer vision tasks.
    submitted by /u/AbjectDrink3276 [link] [comments]  ( 1 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 1 min )
    [P] Comparison of NLP Sentiment Analysis ML models - VADER vs roBERTa
    submitted by /u/robikscuber [link] [comments]
    [N] Deepmind's Flamingo combines speech and vision
    submitted by /u/much_successes [link] [comments]
    [D] Dynamic Time Warping to align sequence to part of a longer sequence
    I have a dataset of audio files, and I wish to create a search functionality in the audio space: given a short fragment containing only one spoken word, I want to find the exact location of that word in a longer video containing multiple words. Let's say I want to avoid audio transcription for some reason. So I want to find the exact location within the audio without knowing which word is used, but just purely based on the waveform. I could try doing this with a sliding window, but I thought Dynamic Time Warping (DTW) would be an interesting alternative. However, DTW requires that the start and end point of both sequences are the same. This isn't the case here, as the query sequence is a subset of the longer sequence (or may even not be present in the longer sequence). Also, finding the start and end points in the longer sequence is kind of the goal, so it's not useful as a prior. Are there any algorithms similar to DTW or any approaches I should know about? submitted by /u/RaptorDotCpp [link] [comments]  ( 3 min )
    [D] AWS sagemaker vs similar tool which can run on any cloud
    Hi all, I've been a user of AWS Sagemaker for quite some time. Now we got some other cloud credits. I want a multi-cloud solution that can help me run training jobs. I'm planning on this approach. Have some multi-cloud object storage. Save all my data here. Will MiniO help? Use Terraform to create configuration files to create nodes on any cloud. How difficult will this be? Write a simple configuration file that can take cloud provider, instance types, training docker, and storage path and run a training job Is this implemented already? I already am exploring some serverless trainings. Any thoughts on this? Any one with the same problem who would like to contribute? submitted by /u/scb_11 [link] [comments]  ( 1 min )
    [P] Optimizing ML Workloads with TPI (Terraform Provider Iterative) - One Provider to Rule AWS, Azure, GCP, K8s
    The guide introduces Terraform Provider Iterative (TPI) - an open-source tool extending the functionality of Terraform. The tool enables full lifecycle management of computing resources and is designed specifically for ML pipelines: Machine Learning Workloads with Terraform Provider Iterative It was designed for machine learning (ML/AI) teams and optimizes CPU/GPU expenses. TPI unifies auto-scaling groups for all the major cloud providers: AWS, Azure, GCP and Kubernetes. Spot instances auto-recovery (if an instance was evicted/terminated) with data and checkpoint synchronization Auto-terminate instances when ML training is finished - you won't forget to terminate your expensive GPU instance for a week :) Using Terraform commands and config (HCL) submitted by /u/cmstrump [link] [comments]  ( 1 min )
    [D] Tips for reviewing at top A* conferences
    I'll be reviewing the paper at a top conference for first time this year so I was wondering if you guys can share some tips on how to be a good reviewer and how you guys typically review a submission at any of the top A* conferences. submitted by /u/furious_madman_123 [link] [comments]  ( 1 min )
    [N] Ian Goodfellow, Apple’s director of machine learning, is leaving the company due to its return to work policy. In a note to staff, he said “I believe strongly that more flexibility would have been the best policy for my team.” He was likely the company’s most cited ML expert.
    submitted by /u/hardmaru [link] [comments]  ( 4 min )
    [D] How much time do you spend transforming data before training a model?
    I am looking for straightforward ways, if they exist, to check for data transformation opportunities for different datasets. submitted by /u/sphinx00777 [link] [comments]
  • Open

    Need help in understanding role of activation functions.
    This question is being asked here after i had a discord with a professor of mine. As far as i understood it, activation functions are used to make neural networks non linear but he believes activates and de activates individual nodes of a NN. I cant get my head around how it activates a neuron. Without using a threshold function. Taking the case of a neuron with a sigmoid activation the output will never be 0, though miniscule in case when x is a large negetive number, but still not 0. I tried to find the answer online but found sites having different opinions. submitted by /u/Pritesh190801 [link] [comments]  ( 2 min )
    The ML pipeline & Types of roles in ML
    submitted by /u/mr-minion [link] [comments]
    Defeating Amazon Rekognition's face detection
    Amazon's IT department, also known as Amazon Web Services, or AWS, developed a computer vision platform called Rekognition. https://aws.amazon.com/rekognition/ Rekognition is a cloud-base software as a service that does image and video analysis. This system does a lot, however my focus is on face recognition of people (not only users of the system, but anyone). AWS provides cloud services, which means they have infinite amount of space on their servers. There are claims that I couldn't confirm (so far) on Amazon getting data of people from not only social media sources, but also from sources that are officially not public. If the data is saved on AWS cloud services, Amazon practically has this data. Amazon takes pride in a long list of customers https://aws.amazon.com/rekognition/cu…  ( 2 min )
  • Open

    Network security guidelines for 2022
    Cyber security has become the buzzword in the tech-savvy world where the entire world hinges on digital platforms. It has become very critical for individuals to become cyber aware. People are thereby developing security tips to help individuals manage their digital presence. Whether for personal use or professional undertaking, cyber security has become a significant… Read More »Network security guidelines for 2022 The post Network security guidelines for 2022 appeared first on Data Science Central.  ( 3 min )
    Understanding EXIF Data and How to View It on Android Phones
    Mοst mοdеrn-day smartphοnеs with grеat camеras makе it rеally еasy fοr anyοnе and еvеryοnе tο click gοοd phοtοs. But what mοst pеοplе dοn’t rеalizе is that with еach phοtοgraph, thеy’rе alsο capturing an lοt οf pеrsοnal infοrmatiοn which, whеn thеy sharе that phοtο οn sοcialmеdia, bеcοmеs availablе tο a much widеrsеt οf pеοplе οn thе… Read More »Understanding EXIF Data and How to View It on Android Phones The post Understanding EXIF Data and How to View It on Android Phones appeared first on Data Science Central.  ( 3 min )
    AI In Manufacturing: Know How Latest Intelligence Reshaping the Industries with Speed and Accuracy
    Last few years ago, the industrial revolution is the most popular evolution ever faced by the industrial sector. It encompasses all the latest technology trends which affecting industries over the world. Autonomous cars, smart connected devices, sensors, computer chips, and many other technologies represented this transformation. This happens due to the manufacturing industry has been… Read More »AI In Manufacturing: Know How Latest Intelligence Reshaping the Industries with Speed and Accuracy The post AI In Manufacturing: Know How Latest Intelligence Reshaping the Industries with Speed and Accuracy appeared first on Data Science Central.  ( 3 min )
    Social Media, Cyber Bullying, and Need For Content Moderation
    A rise in netizens and social media platforms, coupled with the growth of the mobile internet, has resulted in a jump in the creation and consumption of User Generated Content (UGC). Social media platforms, in all honesty, have become a major channel for disseminating, circulating, and exchanging information to billions of people on the internet… Read More »Social Media, Cyber Bullying, and Need For Content Moderation The post Social Media, Cyber Bullying, and Need For Content Moderation appeared first on Data Science Central.  ( 6 min )
    Comparing Cloud Telephony With On-Premise Call Centers: An Upgrade
    Contact center dynamics are changing as well as the customer engagement landscape. Due to the growing customer expectations and time constraints, cloud call center solutions are more important than ever before. With technology, businesses and agents can offer omnichannel customer service, and customers can handle most queries themselves. In order to achieve high omnichannel engagement,… Read More »Comparing Cloud Telephony With On-Premise Call Centers: An Upgrade The post Comparing Cloud Telephony With On-Premise Call Centers: An Upgrade appeared first on Data Science Central.  ( 4 min )
    The long game: Desiloed systems and feedback loops by design (I of II)
    Data management (DM) discussions can be frustrating because both those feeling the pain and the consultants who try to help them are–90+ percent of the time, it seems–still using the same old ways. Those ways only go so far, and won’t go any farther. That’s because those who reinforce the old ways assume that what… Read More »The long game: Desiloed systems and feedback loops by design (I of II) The post The long game: Desiloed systems and feedback loops by design (I of II) appeared first on Data Science Central.  ( 5 min )
    The Graph of Thrones: the Now Graph and Eternal Graph in RDF-Star Modeling
    Warning: This is going to get heavy into Turtle code, but I think there’s enough here for it to be worth reading if you are involved in knowledge graph work. I’ve been working with knowledge graphs a lot lately, and a conversation that I had with a few other ontologists has been resonating in my… Read More »The Graph of Thrones: the Now Graph and Eternal Graph in RDF-Star Modeling The post The Graph of Thrones: the Now Graph and Eternal Graph in RDF-Star Modeling appeared first on Data Science Central.  ( 9 min )
  • Open

    Driving Experimentation Forward through a Working Group (Experimentation Program Series: Guide 03)
    In my previous post, I defined an experimentation program (ExPr) as the mechanism by which a company uses randomized controlled experiments to generate positive business results. An ExPr is composed of the people, processes, and infrastructure for running experiments at… Read More The post Driving Experimentation Forward through a Working Group (Experimentation Program Series: Guide 03) appeared first on ML in Production.  ( 7 min )
  • Open

    Driving Experimentation Forward through a Working Group (Experimentation Program Series: Guide 03)
    In my previous post, I defined an experimentation program (ExPr) as the mechanism by which a company uses randomized controlled experiments to generate positive business results. An ExPr is composed of the people, processes, and infrastructure for running experiments at… Read More The post Driving Experimentation Forward through a Working Group (Experimentation Program Series: Guide 03) appeared first on ML in Production.  ( 7 min )

  • Open

    Any interesting “less obvious” real-world applications of RL
    Reinforcement learning has immediate applications in industrial robotics and other control oriented tasks. Are there any interesting real-world applications of RL that is less obvious than robotics or trading? I saw one application in the crypto space and am curious about the other different possible applications (can be in any sector) of RL submitted by /u/blitzkreig3 [link] [comments]  ( 1 min )
    No module named 'gym.envs.classic_control.rendering' . How do I fix it?
    I ran into it when running retro.examples.random_agent . submitted by /u/sirThomasmacAbyss [link] [comments]  ( 1 min )
    Reasonable training result, but how to improve further?
    Hi all, I have a 4 dof robot. I am trying to teach this specifical movement: "Whenever you move, dont move joint 1 (orange in the plot) at the same time with joint 2, 3, 4". The corresponding reward function is: reward= 1/( abs(torque_q1) * max(abs(torque_q2) , abs(torque_q3), abs(torque_q4) ) As the plot shows, the learned policy somehow reprocues the intended movement: first q1 movement and the other joints. But the part that I want to improve is around at t=13. There q1 gradually decreases and the other joints gradually start to move. Is there a way to improve this so that there is a complete stop of q1 movement and then the other joints start to move? ​ https://preview.redd.it/do8gaqyhm3y81.png?width=2892&format=png&auto=webp&s=2bbc13523e9043de7a6316f3edd2e2741e910cd0 submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    Is it possible to train the Reinforcement Learning agent using a data set (in order to make use of human expertise in a game set for example) to reinforce the training (like a combination between reinforcement learning and supervised learning )?
    submitted by /u/Ok_Lab_2750 [link] [comments]  ( 2 min )
    text summarization using RL
    is it worth to work on text summarization even though big techs are already working on this problem and already achieved good results. Can someone suggest the topic which can be explored using techniques of RL, in the domain of NLP submitted by /u/Western-Age3148 [link] [comments]  ( 1 min )
    Can we use a stochastic policy in DDPG framework?
    In DDPG we use a deterministic policy, i.e. s--->a, each state has a deterministic action, but can we use a stochastic policy, i.e. the actions are sampled from a distribution. I think this might encourage exploration and thus can improve the performance. What do you think? submitted by /u/Better-Ad8608 [link] [comments]  ( 2 min )
    application of Reinforcement learning in NLP
    someone please give abstract how RL is being used in NLP. looking for someone to provide content for reinforcement learning in NLP. including papers, and some other sources submitted by /u/Western-Age3148 [link] [comments]  ( 1 min )
    Anyone has experience with Isaac Gym
    Hi all, did anyone try to use Isaac Gym for a custom robot/ algorithm? In example scripts, they use def pre_physics_step(self, actions): to call the actions for the robot that is a child class of BaseTask. Unfortunately, I can not modify how these actions are created as the script for BaseTask is not open-sourced. Did anyone manage to modify the value of actions for the custom usage? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
  • Open

    [P] Markov chain with unequal sequence lengths
    I'm trying to build a simple Markov chain. I have data from therapy notes, where the therapist selects the overall topic of the session out a list of 27 possible topics. The problem is that not every therapist tags their session topic consistently, and all clients have a different number of sessions. I'm trying to build a simple Markov chain to get probabilities of topic transition between sessions, but as you can see, the data are complicated. I was wondering if anyone has encountered a situation where there were uneven sequence lengths per observation/case and how did you go about building a Markov chain in this case? Thanks! submitted by /u/sebelly [link] [comments]  ( 1 min )
    WACV 2023? [R]
    I don't see a website or any info for WACV 2023. Do you think it will still be hosted, and what would be the expected submission deadline for round 1? submitted by /u/avd4292 [link] [comments]
    [D] The state of reverse summarization/highly conditioned text gen
    Hi all, I’m looking to do a bit of research into highly conditioned text generation, specifically reverse summarization tasks. This type of task basically entails the reverse of what models like Pegasus aim to accomplish; instead of summarizing long text, the goal is to generate the long text from a summary. I’m curious if there’s been any work in this before or if there is an existing state of the art? It’s impossible to search for this online since summarization is so popular that everything I can find just entails going from long text to summary. I’d especially be curious about attempts at this task on conversational data (e.g. the reverse of SAMSum) or news articles (the reverse of CNN/DailyMail). submitted by /u/woodworksio [link] [comments]  ( 1 min )
    Good dataset for lasso regression [P]
    Hello everyone, I'm currently doing a memoir on accelerated proximal point algorithm and I need to prove the efficiency of my accelerated algorithm with a lasso regression. The thing is I need to have a dataset which has a lot of features ( to show the efficiency of lasso) like between 1000 and 2000 , do you know where I can download such dataset for Lasso Regression, that would be a huge help Thank You!!! submitted by /u/Jeffrosslostson [link] [comments]  ( 1 min )
    [D] An End-to-End MLOps Platform Implementation using Open-source Tooling
    submitted by /u/ponderinghydrogen [link] [comments]
    [R] Fix the Noise: Disentangling Source Feature for Transfer Learning of StyleGAN (CVPRW 2022)
    submitted by /u/Cautious-Ad1373 [link] [comments]
    [P] T-SNE to view and order your Spotify tracks
    submitted by /u/40wd [link] [comments]  ( 2 min )
    [P] I created a DALL·E Flow website
    submitted by /u/tomd_96 [link] [comments]  ( 1 min )
    [R][P] Thin-Plate Spline Motion Model for Image Animation + Gradio Web Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 2 min )
    [D] What are the issues with using TMLE/G comp/Double Robust estimators to interpret ML models with marginal effects?
    So TMLE is a way to do causal inference using ML models. It is described in this book https://tlverse.org/tlverse-handbook/tmle3.html Of course, the causality part comes from the domain assumptions and causal graph, without that its just regular statistical inference/estimation. Kevin Murphy’s Prob ML 2 in Ch 35 also describes the G computation procedure as well as Double Robust estimation and obtaining uncertainty. Briefly, G comp involves perturbing the variable of interest by a small epsilon in both directions, making predictions on both datasets, averaging and dividing by 2eps. If the variable is categorical then you just do this for each category. This amounts to Pearl’s backdoor adjustment formula. If the causal assumptions are satisfied, this estimate is causal, otherwise it is just some marginal effect. People say that ML models struggle with causality and interpretability and are black boxes, but what is the issue with the above approach? Using G comp and enough data, in theory I could just throw a black box at the problem and still obtain an interpretable average effect size for an exposure (x variable) of interest, and if my variable selection was done right-it is also causal. Furthermore, this approach avoids parametric assumptions that are there in traditional regression, which would invalidate the inference if not satisfied anyways. So why isn’t this new causal or marginal effect stuff used more? It seems too good to be true, its possible to obtain a CI and p value with these methods, yet they haven’t seemed to pick up much yet outside some academic papers. Is the weakness that there is more to interpretability than just a CI/p value/effect size? What are you looking for with it? submitted by /u/111llI0__-__0Ill111 [link] [comments]  ( 4 min )
    [P] Does it make sense to train a model twice on separate datasets?
    TLDR: If I use transfer learning to train a model on one general face dataset, does it make sense to then train it again on a more specific set of faces (a set more similar to my test set) or will the training of the second dataset just overwrite everything learnt from the more general dataset? So I have the task of using a CNN for facial recognition. So I am using it for the classification of faces to different classes of people, each individual person being a it's own separate class. The training data I am given is very limited - I only have one image for each class. I have 100 classes (so I have 100 images in total, one image of each person). The approach I am using is transfer learning of the GoogLenet architecture. However, instead of just training the googLenet on the images of the …  ( 3 min )
    LSTM/CNN architectures for time series forecasting[Discussion]
    I've been doing data engineering for the last year and a half and have just started going back into ML. I have a time series classification problem with a mix of time varying covariates and static covariates. In the past, I've tried different types of RNNS, CNNs, and even CNN-LSTM. However, I haven't kept up with the latest literature and so I feel a bit behind on what the typical ordering of the architecture looks like. And looking over a bunch of different examples, I don't see a recurring pattern with any of them. For example, in between CNN layers, they will do pooling + BN + Relu + Dropout. Then others will skip Dropout do relu before BN. And on top of all of that, I haven't seen almost any other forms of regularization behing applied besides the one's above and early stopping. I realize this is a broad question, but in general, assuming you are done with your preprocessing phase, what is the standard architectural ordering people do for LSTM and CNNs? Also want to confirm. Is the standard approach for situations with different covariate types to run the time varying stuff through LSTM/CNNs and then concat to the static covariates and then run that through feed forward layers? submitted by /u/Think-Culture-4740 [link] [comments]  ( 1 min )
  • Open

    AI News | Breast Cancer Detection Beats Humans | AI Learns Concepts Across Video, Audio & Text | AI Music Hit Song Detection
    submitted by /u/getrich_or_diemining [link] [comments]  ( 1 min )
    From Data Center Modeling to Edge Deployments with Monitoring: Domino Enterprise MLOps for Oil & Gas Machine Learning
    submitted by /u/Dracutela [link] [comments]
    9 Best Artificial Intelligence books for beginners to expert to read in 2022 -
    submitted by /u/maneesh123456 [link] [comments]
    I created a website for accessing DALL·E Flow
    I created this Streamlit website for Jina AI's awesome DALL·E Flow project. What do you think? submitted by /u/tomd_96 [link] [comments]
    Python library for deploying machine learning models
    submitted by /u/Illustrious_Row_9971 [link] [comments]
  • Open

    Mentally calculating the day of the week
    In my previous post I mentioned John Conway’s Doomsday rule for calculating the day of the week for any date. This method starts off very simple, but gets more complicated when you actually use it. This post will present an alternative method that’s easier to use in practice and can be described more succinctly. Here’s […] Mentally calculating the day of the week first appeared on John D. Cook.  ( 3 min )
  • Open

    Determine whether an image, from a given sequence of images, has changed considerably
    Problem Statement: You have a sequence of images for a room that contains three important components: sink, piping, walls, and fire alarm. An image is taken every week. That means you have an initial image taken on week 0 and an image I taken on week I. The room is undergoing construction and each week you have to determine whether the room has undergone sufficient change. This process must continue until the completion of work on the room. There is no restriction on the perspective from which the image would be taken. So, for example, the image from week A would be of the room taken while standing in the right doorway while the image taken from week A+1 could be taken while standing in the left doorway. You have to determine whether the three important components (sink, piping, walls, …  ( 2 min )
    Decimal to Binary Converter with an MLPNN
    Is it possible to implement a simple multi-layer perception neural network that can convert decimal to unsigned binary? If it is possible, what activation functions would I need and how deep and wide would my layers need to be? I've tried training it with 14 trials with 100% accuracy for 5-bit ranging from 0 to 30 with a variety of desired outputs, so lack of data is most likely not a problem. My current (and fairly arbritary) layer structure is: 1 input node 10 tanh nodes 10 tanh nodes 5 sigmoid nodes submitted by /u/EpicJoeR [link] [comments]  ( 1 min )
    is FANN good in 2022?
    i made a game a few years ago in C++ with FANN just to mess around.Now I'm revisiting the concept but I want to do a better job. FANN is perfectly fast enough- on my PC it can run tens of thousands of small networks forward at 50 frames per second. But it is unstable and tends to crash every minute or so while doing that. Unfortunately the library does its own memory management so I don't really know how to stabilize it. If I want a neural net library that's fast, simple, and stable in C++, is FANN the best choice or still even a good choice? submitted by /u/rambutang [link] [comments]  ( 1 min )
  • Open

    U-FNO -- An enhanced Fourier neural operator-based deep-learning model for multiphase flow. (arXiv:2109.03697v3 [physics.geo-ph] UPDATED)
    Numerical simulation of multiphase flow in porous media is essential for many geoscience applications. Machine learning models trained with numerical simulation data can provide a faster alternative to traditional simulators. Here we present U-FNO, a novel neural network architecture for solving multiphase flow problems with superior accuracy, speed, and data efficiency. U-FNO is designed based on the newly proposed Fourier neural operator (FNO), which has shown excellent performance in single-phase flows. We extend the FNO-based architecture to a highly complex CO2-water multiphase problem with wide ranges of permeability and porosity heterogeneity, anisotropy, reservoir conditions, injection configurations, flow rates, and multiphase flow properties. The U-FNO architecture is more accurate in gas saturation and pressure buildup predictions than the original FNO and a state-of-the-art convolutional neural network (CNN) benchmark. Meanwhile, it has superior data utilization efficiency, requiring only a third of the training data to achieve the equivalent accuracy as CNN. U-FNO provides superior performance in highly heterogeneous geological formations and critically important applications such as gas saturation and pressure buildup "fronts" determination. The trained model can serve as a general-purpose alternative to routine numerical simulations of 2D-radial CO2 injection problems with significant speed-ups than traditional simulators.
    An Explanation of In-context Learning as Implicit Bayesian Inference. (arXiv:2111.02080v5 [cs.CL] UPDATED)
    Large language models (LMs) such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. The LM learns from these examples without being explicitly pretrained to learn. Thus, it is unclear what enables in-context learning. In this paper, we study how in-context learning can emerge when pretraining documents have long-range coherence. Here, the LM must infer a latent document-level concept to generate coherent next tokens during pretraining. At test time, in-context learning occurs when the LM also infers a shared latent concept between examples in a prompt. We prove when this occurs despite a distribution mismatch between prompts and pretraining data in a setting where the pretraining distribution is a mixture of HMMs. In contrast to messy large-scale datasets used to train LMs capable of in-context learning, we generate a small-scale synthetic dataset (GINC) where Transformers and LSTMs both exhibit in-context learning. Beyond the theory, experiments on GINC exhibit large-scale real-world phenomena including improved in-context performance with model scaling (despite the same pretraining loss), sensitivity to example order, and instances where zero-shot is better than few-shot in-context learning.
    Dangling-Aware Entity Alignment with Mixed High-Order Proximities. (arXiv:2205.02406v1 [cs.CL])
    We study dangling-aware entity alignment in knowledge graphs (KGs), which is an underexplored but important problem. As different KGs are naturally constructed by different sets of entities, a KG commonly contains some dangling entities that cannot find counterparts in other KGs. Therefore, dangling-aware entity alignment is more realistic than the conventional entity alignment where prior studies simply ignore dangling entities. We propose a framework using mixed high-order proximities on dangling-aware entity alignment. Our framework utilizes both the local high-order proximity in a nearest neighbor subgraph and the global high-order proximity in an embedding space for both dangling detection and entity alignment. Extensive experiments with two evaluation settings shows that our framework more precisely detects dangling entities, and better aligns matchable entities. Further investigations demonstrate that our framework can mitigate the hubness problem on dangling-aware entity alignment.
    Automated Imbalanced Classification via Layered Learning. (arXiv:2205.02553v1 [cs.LG])
    In this paper we address imbalanced binary classification (IBC) tasks. Applying resampling strategies to balance the class distribution of training instances is a common approach to tackle these problems. Many state-of-the-art methods find instances of interest close to the decision boundary to drive the resampling process. However, under-sampling the majority class may potentially lead to important information loss. Over-sampling also may increase the chance of overfitting by propagating the information contained in instances from the minority class. The main contribution of our work is a new method called ICLL for tackling IBC tasks which is not based on resampling training observations. Instead, ICLL follows a layered learning paradigm to model the data in two stages. In the first layer, ICLL learns to distinguish cases close to the decision boundary from cases which are clearly from the majority class, where this dichotomy is defined using a hierarchical clustering analysis. In the subsequent layer, we use instances close to the decision boundary and instances from the minority class to solve the original predictive task. A second contribution of our work is the automatic definition of the layers which comprise the layered learning strategy using a hierarchical clustering model. This is a relevant discovery as this process is usually performed manually according to domain knowledge. We carried out extensive experiments using 100 benchmark data sets. The results show that the proposed method leads to a better performance relatively to several state-of-the-art methods for IBC.
    Characterizing player's playing styles based on Player Vectors for each playing position in the Chinese Football Super League. (arXiv:2205.02731v1 [cs.LG])
    Characterizing playing style is important for football clubs on scouting, monitoring and match preparation. Previous studies considered a player's style as a combination of technical performances, failing to consider the spatial information. Therefore, this study aimed to characterize the playing styles of each playing position in the Chinese Football Super League (CSL) matches, integrating a recently adopted Player Vectors framework. Data of 960 matches from 2016-2019 CSL were used. Match ratings, and ten types of match events with the corresponding coordinates for all the lineup players whose on-pitch time exceeded 45 minutes were extracted. Players were first clustered into 8 positions. A player vector was constructed for each player in each match based on the Player Vectors using Nonnegative Matrix Factorization (NMF). Another NMF process was run on the player vectors to extract different types of playing styles. The resulting player vectors discovered 18 different playing styles in the CSL. Six performance indicators of each style were investigated to observe their contributions. In general, the playing styles of forwards and midfielders are in line with football performance evolution trends, while the styles of defenders should be reconsidered. Multifunctional playing styles were also found in high rated CSL players.
    A Multimodal German Dataset for Automatic Lip Reading Systems and Transfer Learning. (arXiv:2202.13403v2 [cs.CV] UPDATED)
    Large datasets as required for deep learning of lip reading do not exist in many languages. In this paper we present the dataset GLips (German Lips) consisting of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language LRW (Lip Reading in the Wild) dataset, with each video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. By training a deep neural network, we investigate whether lip reading has language-independent features, so that datasets of different languages can be used to improve lip reading models. We demonstrate learning from scratch and show that transfer learning from LRW to GLips and vice versa improves learning speed and performance, in particular for the validation set.
    RANG: A Residual-based Adaptive Node Generation Method for Physics-Informed Neural Networks. (arXiv:2205.01051v2 [cs.LG] UPDATED)
    Learning solutions of partial differential equations (PDEs) with Physics-Informed Neural Networks (PINNs) is an attractive alternative approach to traditional solvers due to its flexibility and ease of incorporating observed data. Despite the success of PINNs in accurately solving a wide variety of PDEs, the method still requires improvements in terms of computational efficiency. One possible improvement idea is to optimize the generation of training point sets. Residual-based adaptive sampling and quasi-uniform sampling approaches have been each applied to improve the training effects of PINNs, respectively. To benefit from both methods, we propose the Residual-based Adaptive Node Generation (RANG) approach for efficient training of PINNs, which is based on a variable density nodal distribution method for RBF-FD. The method is also enhanced by a memory mechanism to further improve training stability. We conduct experiments on three linear PDEs and three nonlinear PDEs with various node generation methods, through which the accuracy and efficiency of the proposed method compared to the predominant uniform sampling approach is verified numerically.  ( 2 min )
    Offline Vehicle Routing Problem with Online Bookings: A Novel Problem Formulation with Applications to Paratransit. (arXiv:2204.11992v2 [cs.AI] UPDATED)
    Vehicle routing problems (VRPs) can be divided into two major categories: offline VRPs, which consider a given set of trip requests to be served, and online VRPs, which consider requests as they arrive in real-time. Based on discussions with public transit agencies, we identify a real-world problem that is not addressed by existing formulations: booking trips with flexible pickup windows (e.g., 3 hours) in advance (e.g., the day before) and confirming tight pickup windows (e.g., 30 minutes) at the time of booking. Such a service model is often required in paratransit service settings, where passengers typically book trips for the next day over the phone. To address this gap between offline and online problems, we introduce a novel formulation, the offline vehicle routing problem with online bookings. This problem is very challenging computationally since it faces the complexity of considering large sets of requests -- similar to offline VRPs -- but must abide by strict constraints on running time -- similar to online VRPs. To solve this problem, we propose a novel computational approach, which combines an anytime algorithm with a learning-based policy for real-time decisions. Based on a paratransit dataset obtained from the public transit agency of Chattanooga, TN, we demonstrate that our novel formulation and computational approach lead to significantly better outcomes in this setting than existing algorithms.
    Subverting Fair Image Search with Generative Adversarial Perturbations. (arXiv:2205.02414v1 [cs.LG])
    In this work we explore the intersection fairness and robustness in the context of ranking: \textit{when a ranking model has been calibrated to achieve some definition of fairness, is it possible for an external adversary to make the ranking model behave unfairly without having access to the model or training data?} To investigate this question, we present a case study in which we develop and then attack a state-of-the-art, fairness-aware image search engine using images that have been maliciously modified using a Generative Adversarial Perturbation (GAP) model. These perturbations attempt to cause the fair re-ranking algorithm to unfairly boost the rank of images containing people from an adversary-selected subpopulation. We present results from extensive experiments demonstrating that our attacks can successfully confer significant unfair advantage to people from the majority class relative to fairly-ranked baseline search results. We demonstrate that our attacks are robust across a number of variables, that they have close to zero impact on the relevance of search results, and that they succeed under a strict threat model. Our findings highlight the danger of deploying fair machine learning algorithms in-the-wild when (1) the data necessary to achieve fairness may be adversarially manipulated, and (2) the models themselves are not robust against attacks.
    Tracking the risk of a deployed model and detecting harmful distribution shifts. (arXiv:2110.06177v4 [stat.ML] UPDATED)
    When deployed in the real world, machine learning models inevitably encounter changes in the data distribution, and certain -- but not all -- distribution shifts could result in significant performance degradation. In practice, it may make sense to ignore benign shifts, under which the performance of a deployed model does not degrade substantially, making interventions by a human expert (or model retraining) unnecessary. While several works have developed tests for distribution shifts, these typically either use non-sequential methods, or detect arbitrary shifts (benign or harmful), or both. We argue that a sensible method for firing off a warning has to both (a) detect harmful shifts while ignoring benign ones, and (b) allow continuous monitoring of model performance without increasing the false alarm rate. In this work, we design simple sequential tools for testing if the difference between source (training) and target (test) distributions leads to a significant increase in a risk function of interest, like accuracy or calibration. Recent advances in constructing time-uniform confidence sequences allow efficient aggregation of statistical evidence accumulated during the tracking process. The designed framework is applicable in settings where (some) true labels are revealed after the prediction is performed, or when batches of labels become available in a delayed fashion. We demonstrate the efficacy of the proposed framework through an extensive empirical study on a collection of simulated and real datasets.
    Efficient and Convergent Federated Learning. (arXiv:2205.01438v2 [cs.LG] UPDATED)
    Federated learning has shown its advances over the last few years but is facing many challenges, such as how algorithms save communication resources, how they reduce computational costs, and whether they converge. To address these issues, this paper proposes a new federated learning algorithm (FedGiA) that combines the gradient descent and the inexact alternating direction method of multipliers. It is shown that FedGiA is computation and communication-efficient and convergent linearly under mild conditions.
    Using Time-Series Privileged Information for Provably Efficient Learning of Prediction Models. (arXiv:2110.14993v2 [cs.LG] UPDATED)
    We study prediction of future outcomes with supervised models that use privileged information during learning. The privileged information comprises samples of time series observed between the baseline time of prediction and the future outcome; this information is only available at training time which differs from the traditional supervised learning. Our question is when using this privileged data leads to more sample-efficient learning of models that use only baseline data for predictions at test time. We give an algorithm for this setting and prove that when the time series are drawn from a non-stationary Gaussian-linear dynamical system of fixed horizon, learning with privileged information is more efficient than learning without it. On synthetic data, we test the limits of our algorithm and theory, both when our assumptions hold and when they are violated. On three diverse real-world datasets, we show that our approach is generally preferable to classical learning, particularly when data is scarce. Finally, we relate our estimator to a distillation approach both theoretically and empirically.  ( 2 min )
    GCN-Transformer for short-term passenger flow prediction on holidays in urban rail transit systems. (arXiv:2203.00007v2 [cs.LG] UPDATED)
    The short-term passenger flow prediction of the urban rail transit system is of great significance for traffic operation and management. The emerging deep learning-based models provide effective methods to improve prediction accuracy. However, most of the existing models mainly predict the passenger flow on general weekdays, while few studies focus on predicting the holiday passenger flow, which can provide more significant information for operators because congestions or accidents generally occur on holidays. To this end, we propose a deep learning-based model named GCN-Transformer comprising graph conventional neural network (GCN) and Transformer for short-term passenger flow prediction on holidays. The GCN is applied to extract the spatial features of passenger flows and the Transformer is applied to extract the temporal features of passenger flows. Moreover, in addition to the historical passenger flow data, social media data are also incorporated into the prediction model, which has been proven to have a potential correlation with the fluctuation of passenger flow. The GCN-Transformer is tested on two large-scale real-world datasets from Nanning, China during the New Year holiday and is compared with several conventional prediction models. Results demonstrate its better robustness and advantages among baseline methods, which provides overwhelming support for practical applications of short-term passenger flow prediction on holidays
    Hardness of Noise-Free Learning for Two-Hidden-Layer Neural Networks. (arXiv:2202.05258v2 [cs.LG] UPDATED)
    We give superpolynomial statistical query (SQ) lower bounds for learning two-hidden-layer ReLU networks with respect to Gaussian inputs in the standard (noise-free) model. No general SQ lower bounds were known for learning ReLU networks of any depth in this setting: previous SQ lower bounds held only for adversarial noise models (agnostic learning) or restricted models such as correlational SQ. Prior work hinted at the impossibility of our result: Vempala and Wilmes showed that general SQ lower bounds cannot apply to any real-valued family of functions that satisfies a simple non-degeneracy condition. To circumvent their result, we refine a lifting procedure due to Daniely and Vardi that reduces Boolean PAC learning problems to Gaussian ones. We show how to extend their technique to other learning models and, in many well-studied cases, obtain a more efficient reduction. As such, we also prove new cryptographic hardness results for PAC learning two-hidden-layer ReLU networks, as well as new lower bounds for learning constant-depth ReLU networks from label queries.
    Fitting an immersed submanifold to data via Sussmann's orbit theorem. (arXiv:2204.01119v2 [cs.LG] UPDATED)
    This paper describes an approach for fitting an immersed submanifold of a finite-dimensional Euclidean space to random samples. The reconstruction mapping from the ambient space to the desired submanifold is implemented as a composition of an encoder that maps each point to a tuple of (positive or negative) times and a decoder given by a composition of flows along finitely many vector fields starting from a fixed initial point. The encoder supplies the times for the flows. The encoder-decoder map is obtained by empirical risk minimization, and a high-probability bound is given on the excess risk relative to the minimum expected reconstruction error over a given class of encoder-decoder maps. The proposed approach makes fundamental use of Sussmann's orbit theorem, which guarantees that the image of the reconstruction map is indeed contained in an immersed submanifold.
    Negative Evidence Matters in Interpretable Histology Image Classification. (arXiv:2201.02445v3 [eess.IV] UPDATED)
    Using only global image-class labels, weakly-supervised learning methods, such as class activation mapping, allow training CNNs to jointly classify an image, and locate regions of interest associated with the predicted class. However, without any guidance at the pixel level, such methods may yield inaccurate regions. This problem is known to be more challenging with histology images than with natural ones, since objects are less salient, structures have more variations, and foreground and background regions have stronger similarities. Therefore, computer vision methods for visual interpretation of CNNs may not directly apply. In this paper, a simple yet efficient method based on a composite loss is proposed to learn information from the fully negative samples (i.e., samples without positive regions), and thereby reduce false positives/negatives. Our new loss function contains two complementary terms: the first exploits positive evidence collected from the CNN classifier, while the second leverages the fully negative samples from training data. In particular, a pre-trained CNN is equipped with a decoder that allows refining the regions of interest. The CNN is exploited to collect both positive and negative evidence at the pixel level to train the decoder. Our method called NEGEV benefits from the fully negative samples that naturally occur in the data, without any additional supervision signals beyond image-class labels. Extensive experiments show that our proposed method can substantial outperform related state-of-art methods on GlaS (public benchmark for colon cancer), and Camelyon16 (patch-based benchmark for breast cancer using three different backbones). Our results highlight the benefits of using both positive and negative evidence, the first obtained from a classifier, and the other naturally available in datasets.  ( 3 min )
    Causal Reasoning with Spatial-temporal Representation Learning: A Prospective Study. (arXiv:2204.12037v3 [cs.CV] UPDATED)
    Spatial-temporal representation learning is ubiquitous in various real-world applications, including visual comprehension, video understanding, multi-modal analysis, human-computer interaction, and urban computing. Due to the emergence of huge amounts of multi-modal heterogeneous spatial/temporal/spatial-temporal data in big data era, the lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models. The majority of the existing methods tend to fit the original data/variable distributions and ignore the essential causal relations behind the multi-modal knowledge, which lacks an unified guidance and analysis about why modern spatial-temporal representation learning methods are easily collapse into data bias and have limited generalization and cognitive abilities. Inspired by the strong inference ability of human-level agents, recent years have therefore witnessed great effort in developing causal reasoning paradigms to realize robust representation and model learning with good cognitive ability. In this paper, we conduct a comprehensive review of existing causal reasoning methods for spatial-temporal representation learning, covering fundamental theories, models, and datasets. The limitations of current methods and datasets are also discussed. Moreover, we propose some primary challenges, opportunities, and future research directions for benchmarking causal reasoning algorithms in spatial-temporal representation learning. This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods, publicly available benchmarks, and consensus-building standards for reliable spatial-temporal representation learning and related real-world applications more efficiently.
    OPT: Open Pre-trained Transformer Language Models. (arXiv:2205.01068v3 [cs.CL] UPDATED)
    Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.
    ISDE : Independence Structure Density Estimation. (arXiv:2203.09783v2 [cs.LG] UPDATED)
    In this paper, we propose ISDE (Independence Structure Density Estimation), an algorithm designed to estimate a multivariate density under Kullback-Leibler loss and the Independence Structure (IS) model. IS tackles the curse of dimensionality by separating features into independent groups. We explain the construction of ISDE and present some experiments to show its performance on synthetic and real-world data. Performance is measured quantitatively by comparing empirical $\log$-likelihood with other density estimation methods and qualitatively by analyzing outputted partitions of variables. We also provide information about complexity and running time.
    Hybrid ISTA: Unfolding ISTA With Convergence Guarantees Using Free-Form Deep Neural Networks. (arXiv:2204.11640v2 [cs.CV] UPDATED)
    It is promising to solve linear inverse problems by unfolding iterative algorithms (e.g., iterative shrinkage thresholding algorithm (ISTA)) as deep neural networks (DNNs) with learnable parameters. However, existing ISTA-based unfolded algorithms restrict the network architectures for iterative updates with the partial weight coupling structure to guarantee convergence. In this paper, we propose hybrid ISTA to unfold ISTA with both pre-computed and learned parameters by incorporating free-form DNNs (i.e., DNNs with arbitrary feasible and reasonable network architectures), while ensuring theoretical convergence. We first develop HCISTA to improve the efficiency and flexibility of classical ISTA (with pre-computed parameters) without compromising the convergence rate in theory. Furthermore, the DNN-based hybrid algorithm is generalized to popular variants of learned ISTA, dubbed HLISTA, to enable a free architecture of learned parameters with a guarantee of linear convergence. To our best knowledge, this paper is the first to provide a convergence-provable framework that enables free-form DNNs in ISTA-based unfolded algorithms. This framework is general to endow arbitrary DNNs for solving linear inverse problems with convergence guarantees. Extensive experiments demonstrate that hybrid ISTA can reduce the reconstruction error with an improved convergence rate in the tasks of sparse recovery and compressive sensing.  ( 2 min )
    Approximate exploitability: Learning a best response in large games. (arXiv:2004.09677v4 [cs.LG] UPDATED)
    Researchers have demonstrated that neural networks are vulnerable to adversarial examples and subtle environment changes, both of which one can view as a form of distribution shift. To humans, the resulting errors can look like blunders, eroding trust in these agents. In prior games research, agent evaluation often focused on the in-practice game outcomes. While valuable, such evaluation typically fails to evaluate robustness to worst-case outcomes. Prior research in computer poker has examined how to assess such worst-case performance, both exactly and approximately. Unfortunately, exact computation is infeasible with larger domains, and existing approximations rely on poker-specific knowledge. We introduce ISMCTS-BR, a scalable search-based deep reinforcement learning algorithm for learning a best response to an agent, thereby approximating worst-case performance. We demonstrate the technique in several two-player zero-sum games against a variety of agents, including several AlphaZero-based agents.  ( 2 min )
    Smooth-Swap: A Simple Enhancement for Face-Swapping with Smoothness. (arXiv:2112.05907v2 [cs.CV] UPDATED)
    Face-swapping models have been drawing attention for their compelling generation quality, but their complex architectures and loss functions often require careful tuning for successful training. We propose a new face-swapping model called `Smooth-Swap', which excludes complex handcrafted designs and allows fast and stable training. The main idea of Smooth-Swap is to build smooth identity embedding that can provide stable gradients for identity change. Unlike the one used in previous models trained for a purely discriminative task, the proposed embedding is trained with a supervised contrastive loss promoting a smoother space. With improved smoothness, Smooth-Swap suffices to be composed of a generic U-Net-based generator and three basic loss functions, a far simpler design compared with the previous models. Extensive experiments on face-swapping benchmarks (FFHQ, FaceForensics++) and face images in the wild show that our model is also quantitatively and qualitatively comparable or even superior to the existing methods.
    Mode Reduction for Markov Jump Systems. (arXiv:2205.02697v1 [eess.SY])
    Switched systems are capable of modeling processes with underlying dynamics that may change abruptly over time. To achieve accurate modeling in practice, one may need a large number of modes, but this may in turn increase the model complexity drastically. Existing work on reducing system complexity mainly considers state space reduction, yet reducing the number of modes is less studied. In this work, we consider Markov jump linear systems (MJSs), a special class of switched systems where the active mode switches according to a Markov chain, and several issues associated with its mode complexity. Specifically, inspired by clustering techniques from unsupervised learning, we are able to construct a reduced MJS with fewer modes that approximates well the original MJS under various metrics. Furthermore, both theoretically and empirically, we show how one can use the reduced MJS to analyze stability and design controllers with significant reduction in computational cost while achieving guaranteed accuracy.  ( 2 min )
    Is Pessimism Provably Efficient for Offline RL?. (arXiv:2012.15085v3 [cs.LG] UPDATED)
    We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. Due to the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the dataset, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators. Without assuming the sufficient coverage of the dataset, we establish a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the information-theoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also minimax optimal. In particular, given the dataset, the learned policy serves as the "best effort" among all policies, as no other policies can do better. Our theoretical analysis identifies the critical role of pessimism in eliminating a notion of spurious correlation, which emerges from the "irrelevant" trajectories that are less covered by the dataset and not informative for the optimal policy.  ( 2 min )
    WPPNets and WPPFlows: The Power of Wasserstein Patch Priors for Superresolution. (arXiv:2201.08157v2 [cs.CV] UPDATED)
    Exploiting image patches instead of whole images have proved to be a powerful approach to tackle various problems in image processing. Recently, Wasserstein patch priors (WPP), which are based on the comparison of the patch distributions of the unknown image and a reference image, were successfully used as data-driven regularizers in the variational formulation of superresolution. However, for each input image, this approach requires the solution of a non-convex minimization problem which is computationally costly. In this paper, we propose to learn two kinds of neural networks in an unsupervised way based on WPP loss functions. First, we show how convolutional neural networks (CNNs) can be incorporated. Once the network, called WPPNet, is learned, it can very efficiently applied to any input image. Second, we incorporate conditional normalizing flows to provide a tool for uncertainty quantification. Numerical examples demonstrate the very good performance of WPPNets for superresolution in various image classes even if the forward operator is known only approximately.
    Object discovery and representation networks. (arXiv:2203.08777v2 [cs.CV] UPDATED)
    The promise of self-supervised learning (SSL) is to leverage large amounts of unlabeled data to solve complex tasks. While there has been excellent progress with simple, image-level learning, recent methods have shown the advantage of including knowledge of image structure. However, by introducing hand-crafted image segmentations to define regions of interest, or specialized augmentation strategies, these methods sacrifice the simplicity and generality that makes SSL so powerful. Instead, we propose a self-supervised learning paradigm that discovers this image structure by itself. Our method, Odin, couples object discovery and representation networks to discover meaningful image segmentations without any supervision. The resulting learning paradigm is simpler, less brittle, and more general, and achieves state-of-the-art transfer learning results for object detection and instance segmentation on COCO, and semantic segmentation on PASCAL and Cityscapes, while strongly surpassing supervised pre-training for video segmentation on DAVIS.
    Resilience of Bayesian Layer-Wise Explanations under Adversarial Attacks. (arXiv:2102.11010v3 [cs.LG] UPDATED)
    We consider the problem of the stability of saliency-based explanations of Neural Network predictions under adversarial attacks in a classification task. Saliency interpretations of deterministic Neural Networks are remarkably brittle even when the attacks fail, i.e. for attacks that do not change the classification label. We empirically show that interpretations provided by Bayesian Neural Networks are considerably more stable under adversarial perturbations of the inputs and even under direct attacks to the explanations. By leveraging recent results, we also provide a theoretical explanation of this result in terms of the geometry of the data manifold. Additionally, we discuss the stability of the interpretations of high level representations of the inputs in the internal layers of a Network. Our results demonstrate that Bayesian methods, in addition to being more robust to adversarial attacks, have the potential to provide more stable and interpretable assessments of Neural Network predictions.
    StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets. (arXiv:2202.00273v2 [cs.LG] UPDATED)
    Computer graphics has experienced a recent surge of data-centric approaches for photorealistic and controllable content creation. StyleGAN in particular sets new standards for generative modeling regarding image quality and controllability. However, StyleGAN's performance severely degrades on large unstructured datasets such as ImageNet. StyleGAN was designed for controllability; hence, prior works suspect its restrictive design to be unsuitable for diverse datasets. In contrast, we find the main limiting factor to be the current training strategy. Following the recently introduced Projected GAN paradigm, we leverage powerful neural network priors and a progressive growing strategy to successfully train the latest StyleGAN3 generator on ImageNet. Our final model, StyleGAN-XL, sets a new state-of-the-art on large-scale image synthesis and is the first to generate images at a resolution of $1024^2$ at such a dataset scale. We demonstrate that this model can invert and edit images beyond the narrow domain of portraits or specific object classes.
    A Change Dynamic Model for the Online Detection of Gradual Change. (arXiv:2205.01054v3 [stat.ML] UPDATED)
    Changes in the statistical properties of a stochastic process are typically assumed to occur via change-points, which demark instantaneous moments of complete and total change in process behavior. In cases where these transitions occur gradually, this assumption can result in a reduced ability to properly identify and respond to process change. With this observation in mind, we introduce a novel change-dynamic model for the online detection of gradual change in a Bayesian framework, in which change-points are used within a hierarchical model to indicate moments of gradual change onset or termination. We apply this model to synthetic data and EEG readings drawn during epileptic seizure, where we find our change-dynamic model can enable faster and more accurate identification of gradual change than traditional change-point models allow.
    Detection of Large Vessel Occlusions using Deep Learning by Deforming Vessel Tree Segmentations. (arXiv:2112.01797v3 [eess.IV] UPDATED)
    Computed Tomography Angiography is a key modality providing insights into the cerebrovascular vessel tree that are crucial for the diagnosis and treatment of ischemic strokes, in particular in cases of large vessel occlusions (LVO). Thus, the clinical workflow greatly benefits from an automated detection of patients suffering from LVOs. This work uses convolutional neural networks for case-level classification trained with elastic deformation of the vessel tree segmentation masks to artificially augment training data. Using only masks as the input to our model uniquely allows us to apply such deformations much more aggressively than one could with conventional image volumes while retaining sample realism. The neural network classifies the presence of an LVO and the affected hemisphere. In a 5-fold cross validated ablation study, we demonstrate that the use of the suggested augmentation enables us to train robust models even from few data sets. Training the EfficientNetB1 architecture on 100 data sets, the proposed augmentation scheme was able to raise the ROC AUC to 0.85 from a baseline value of 0.56 using no augmentation. The best performance was achieved using a 3D-DenseNet yielding an AUC of 0.87. The augmentation had positive impact in classification of the affected hemisphere as well, where the 3D-DenseNet reached an AUC of 0.93 on both sides.
    VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers. (arXiv:2203.17247v2 [cs.CV] UPDATED)
    Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems. However, although visualization and interpretability tools have become available for NLP models, internal mechanisms of vision and multimodal transformers remain largely opaque. With the success of these transformers, it is increasingly critical to understand their inner workings, as unraveling these black-boxes will lead to more capable and trustworthy models. To contribute to this quest, we propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers. VL-InterpreT is a task agnostic and integrated tool that (1) tracks a variety of statistics in attention heads throughout all layers for both vision and language components, (2) visualizes cross-modal and intra-modal attentions through easily readable heatmaps, and (3) plots the hidden representations of vision and language tokens as they pass through the transformer layers. In this paper, we demonstrate the functionalities of VL-InterpreT through the analysis of KD-VLP, an end-to-end pretraining vision-language multimodal transformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and WebQA, two visual question answering benchmarks. Furthermore, we also present a few interesting findings about multimodal transformer behaviors that were learned through our tool.
    Local Latin Hypercube Refinement for Multi-objective Design Uncertainty Optimization. (arXiv:2108.08890v2 [stat.ML] UPDATED)
    Optimizing the reliability and the robustness of a design is important but often unaffordable due to high sample requirements. Surrogate models based on statistical and machine learning methods are used to increase the sample efficiency. However, for higher dimensional or multi-modal systems, surrogate models may also require a large amount of samples to achieve good results. We propose a sequential sampling strategy for the surrogate based solution of multi-objective reliability based robust design optimization problems. Proposed local Latin hypercube refinement (LoLHR) strategy is model-agnostic and can be combined with any surrogate model because there is no free lunch but possibly a budget one. The proposed method is compared to stationary sampling as well as other proposed strategies from the literature. Gaussian process and support vector regression are both used as surrogate models. Empirical evidence is presented, showing that LoLHR achieves on average better results compared to other surrogate based strategies on the tested examples.
    LDSA: Learning Dynamic Subtask Assignment in Cooperative Multi-Agent Reinforcement Learning. (arXiv:2205.02561v1 [cs.LG])
    Cooperative multi-agent reinforcement learning (MARL) has made prominent progress in recent years. For training efficiency and scalability, most of the MARL algorithms make all agents share the same policy or value network. However, many complex multi-agent tasks require agents with a variety of specific abilities to handle different subtasks. Sharing parameters indiscriminately may lead to similar behaviors across all agents, which will limit the exploration efficiency and be detrimental to the final performance. To balance the training complexity and the diversity of agents' behaviors, we propose a novel framework for learning dynamic subtask assignment (LDSA) in cooperative MARL. Specifically, we first introduce a subtask encoder that constructs a vector representation for each subtask according to its identity. To reasonably assign agents to different subtasks, we propose an ability-based subtask selection strategy, which can dynamically group agents with similar abilities into the same subtask. Then, we condition the subtask policy on its representation and agents dealing with the same subtask share their experiences to train the subtask policy. We further introduce two regularizers to increase the representation difference between subtasks and avoid agents changing subtasks frequently to stabilize training, respectively. Empirical results show that LDSA learns reasonable and effective subtask assignment for better collaboration and significantly improves the learning performance on the challenging StarCraft II micromanagement benchmark.  ( 2 min )
    One Size Does Not Fit All: The Case for Personalised Word Complexity Models. (arXiv:2205.02564v1 [cs.CL])
    Complex Word Identification (CWI) aims to detect words within a text that a reader may find difficult to understand. It has been shown that CWI systems can improve text simplification, readability prediction and vocabulary acquisition modelling. However, the difficulty of a word is a highly idiosyncratic notion that depends on a reader's first language, proficiency and reading experience. In this paper, we show that personal models are best when predicting word complexity for individual readers. We use a novel active learning framework that allows models to be tailored to individuals and release a dataset of complexity annotations and models as a benchmark for further research.  ( 2 min )
    Quantifying Adaptability in Pre-trained Language Models with 500 Tasks. (arXiv:2112.03204v2 [cs.CL] UPDATED)
    When a neural language model (LM) is adapted to perform a new task, what aspects of the task predict the eventual performance of the model? In NLP, systematic features of LM generalization to individual examples are well characterized, but systematic aspects of LM adaptability to new tasks are not nearly as well understood. We present a large-scale empirical study of the features and limits of LM adaptability using a new benchmark, TaskBench500, built from 500 procedurally generated sequence modeling tasks. These tasks combine core aspects of language processing, including lexical semantics, sequence processing, memorization, logical reasoning, and world knowledge. Using TaskBench500, we evaluate three facets of adaptability, finding that: (1) adaptation procedures differ dramatically in their ability to memorize small datasets; (2) within a subset of task types, adaptation procedures exhibit compositional adaptability to complex tasks; and (3) failure to match training label distributions is explained by mismatches in the intrinsic difficulty of predicting individual labels. Our experiments show that adaptability to new tasks, like generalization to new examples, can be systematically described and understood, and we conclude with a discussion of additional aspects of adaptability that could be studied using the new benchmark.
    REDS: Rule Extraction for Discovering Scenarios. (arXiv:1910.01713v2 [cs.LG] UPDATED)
    Scenario discovery is the process of finding areas of interest, known as scenarios, in data spaces resulting from simulations. For instance, one might search for conditions, i.e., inputs of the simulation model, where the system is unstable. Subgroup discovery methods are commonly used for scenario discovery. They find scenarios in the form of hyperboxes, which are easy to comprehend. Given a computational budget, results tend to get worse as the number of inputs of the simulation model and the cost of simulations increase. We propose a new procedure for scenario discovery from few simulations, dubbed REDS. A key ingredient is using an intermediate machine learning model to label data for subsequent use by conventional subgroup discovery methods. We provide statistical arguments why this is an improvement. In our experiments, REDS reduces the number of simulations required by 50--75\% on average, depending on the quality measure. It is also useful as a semi-supervised subgroup discovery method and for discovering better scenarios from third-party data, when a simulation model is not available.  ( 2 min )
    Dual Octree Graph Networks for Learning Adaptive Volumetric Shape Representations. (arXiv:2205.02825v1 [cs.CV])
    We present an adaptive deep representation of volumetric fields of 3D shapes and an efficient approach to learn this deep representation for high-quality 3D shape reconstruction and auto-encoding. Our method encodes the volumetric field of a 3D shape with an adaptive feature volume organized by an octree and applies a compact multilayer perceptron network for mapping the features to the field value at each 3D position. An encoder-decoder network is designed to learn the adaptive feature volume based on the graph convolutions over the dual graph of octree nodes. The core of our network is a new graph convolution operator defined over a regular grid of features fused from irregular neighboring octree nodes at different levels, which not only reduces the computational and memory cost of the convolutions over irregular neighboring octree nodes, but also improves the performance of feature learning. Our method effectively encodes shape details, enables fast 3D shape reconstruction, and exhibits good generality for modeling 3D shapes out of training categories. We evaluate our method on a set of reconstruction tasks of 3D shapes and scenes and validate its superiority over other existing approaches. Our code, data, and trained models are available at https://wang-ps.github.io/dualocnn.
    FedSPLIT: One-Shot Federated Recommendation System Based on Non-negative Joint Matrix Factorization and Knowledge Distillation. (arXiv:2205.02359v1 [cs.LG])
    Non-negative matrix factorization (NMF) with missing-value completion is a well-known effective Collaborative Filtering (CF) method used to provide personalized user recommendations. However, traditional CF relies on the privacy-invasive collection of users' explicit and implicit feedback to build a central recommender model. One-shot federated learning has recently emerged as a method to mitigate the privacy problem while addressing the traditional communication bottleneck of federated learning. In this paper, we present the first unsupervised one-shot federated CF implementation, named FedSPLIT, based on NMF joint factorization. In our solution, the clients first apply local CF in-parallel to build distinct client-specific recommenders. Then, the privacy-preserving local item patterns and biases from each client are shared with the processor to perform joint factorization in order to extract the global item patterns. Extracted patterns are then aggregated to each client to build the local models via knowledge distillation. In our experiments, we demonstrate the feasibility of our approach with standard recommendation datasets. FedSPLIT can obtain similar results than the state of the art (and even outperform it in certain situations) with a substantial decrease in the number of communications.
    Visual Domain Adaptation for Monocular Depth Estimation on Resource-Constrained Hardware. (arXiv:2108.02671v2 [cs.CV] UPDATED)
    Real-world perception systems in many cases build on hardware with limited resources to adhere to cost and power limitations of their carrying system. Deploying deep neural networks on resource-constrained hardware became possible with model compression techniques, as well as efficient and hardware-aware architecture design. However, model adaptation is additionally required due to the diverse operation environments. In this work, we address the problem of training deep neural networks on resource-constrained hardware in the context of visual domain adaptation. We select the task of monocular depth estimation where our goal is to transform a pre-trained model to the target's domain data. While the source domain includes labels, we assume an unlabelled target domain, as it happens in real-world applications. Then, we present an adversarial learning approach that is adapted for training on the device with limited resources. Since visual domain adaptation, i.e. neural network training, has not been previously explored for resource-constrained hardware, we present the first feasibility study for image-based depth estimation. Our experiments show that visual domain adaptation is relevant only for efficient network architectures and training sets at the order of a few hundred samples. Models and code are publicly available.
    Based-CE white-box adversarial attack will not work using super-fitting. (arXiv:2205.02741v1 [cs.LG])
    Deep Neural Networks (DNN) are widely used in various fields due to their powerful performance, but recent studies have shown that deep learning models are vulnerable to adversarial attacks-by adding a slight perturbation to the input, the model will get wrong results. It is especially dangerous for some systems with high security requirements, so this paper proposes a new defense method by using the model super-fitting status. Model's adversarial robustness (i.e., the accuracry under adversarial attack) has been greatly improved in this status. This paper mathematically proves the effectiveness of super-fitting, and proposes a method to make the model reach this status quickly-minimaze unrelated categories scores (MUCS). Theoretically, super-fitting can resist any existing (even future) Based on CE white-box adversarial attack. In addition, this paper uses a variety of powerful attack algorithms to evaluate the adversarial robustness of super-fitting and other nearly 50 defense models from recent conferences. The experimental results show that super-fitting method in this paper can make the trained model obtain the highest adversarial performance robustness.
    Uncertainty Minimization for Personalized Federated Semi-Supervised Learning. (arXiv:2205.02438v1 [cs.LG])
    Since federated learning (FL) has been introduced as a decentralized learning technique with privacy preservation, statistical heterogeneity of distributed data stays the main obstacle to achieve robust performance and stable convergence in FL applications. Model personalization methods have been studied to overcome this problem. However, existing approaches are mainly under the prerequisite of fully labeled data, which is unrealistic in practice due to the requirement of expertise. The primary issue caused by partial-labeled condition is that, clients with deficient labeled data can suffer from unfair performance gain because they lack adequate insights of local distribution to customize the global model. To tackle this problem, 1) we propose a novel personalized semi-supervised learning paradigm which allows partial-labeled or unlabeled clients to seek labeling assistance from data-related clients (helper agents), thus to enhance their perception of local data; 2) based on this paradigm, we design an uncertainty-based data-relation metric to ensure that selected helpers can provide trustworthy pseudo labels instead of misleading the local training; 3) to mitigate the network overload introduced by helper searching, we further develop a helper selection protocol to achieve efficient communication with negligible performance sacrifice. Experiments show that our proposed method can obtain superior performance and more stable convergence than other related works with partial labeled data, especially in highly heterogeneous setting.  ( 2 min )
    DeepBayes -- an estimator for parameter estimation in stochastic nonlinear dynamical models. (arXiv:2205.02264v1 [stat.ML])
    Stochastic nonlinear dynamical systems are ubiquitous in modern, real-world applications. Yet, estimating the unknown parameters of stochastic, nonlinear dynamical models remains a challenging problem. The majority of existing methods employ maximum likelihood or Bayesian estimation. However, these methods suffer from some limitations, most notably the substantial computational time for inference coupled with limited flexibility in application. In this work, we propose DeepBayes estimators that leverage the power of deep recurrent neural networks in learning an estimator. The method consists of first training a recurrent neural network to minimize the mean-squared estimation error over a set of synthetically generated data using models drawn from the model set of interest. The a priori trained estimator can then be used directly for inference by evaluating the network with the estimation data. The deep recurrent neural network architectures can be trained offline and ensure significant time savings during inference. We experiment with two popular recurrent neural networks -- long short term memory network (LSTM) and gated recurrent unit (GRU). We demonstrate the applicability of our proposed method on different example models and perform detailed comparisons with state-of-the-art approaches. We also provide a study on a real-world nonlinear benchmark problem. The experimental evaluations show that the proposed approach is asymptotically as good as the Bayes estimator.  ( 2 min )
    Generative methods for sampling transition paths in molecular dynamics. (arXiv:2205.02818v1 [stat.ML])
    Molecular systems often remain trapped for long times around some local minimum of the potential energy function, before switching to another one -- a behavior known as metastability. Simulating transition paths linking one metastable state to another one is difficult by direct numerical methods. In view of the promises of machine learning techniques, we explore in this work two approaches to more efficiently generate transition paths: sampling methods based on generative models such as variational autoencoders, and importance sampling methods based on reinforcement learning.  ( 2 min )
    Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline Reinforcement Learning. (arXiv:2205.02450v1 [cs.LG])
    Dynamic mechanism design has garnered significant attention from both computer scientists and economists in recent years. By allowing agents to interact with the seller over multiple rounds, where agents' reward functions may change with time and are state dependent, the framework is able to model a rich class of real world problems. In these works, the interaction between agents and sellers are often assumed to follow a Markov Decision Process (MDP). We focus on the setting where the reward and transition functions of such an MDP are not known a priori, and we are attempting to recover the optimal mechanism using an a priori collected data set. In the setting where the function approximation is employed to handle large state spaces, with only mild assumptions on the expressiveness of the function class, we are able to design a dynamic mechanism using offline reinforcement learning algorithms. Moreover, learned mechanisms approximately have three key desiderata: efficiency, individual rationality, and truthfulness. Our algorithm is based on the pessimism principle and only requires a mild assumption on the coverage of the offline data set. To the best of our knowledge, our work provides the first offline RL algorithm for dynamic mechanism design without assuming uniform coverage.  ( 2 min )
    Spiking Graph Convolutional Networks. (arXiv:2205.02767v1 [cs.LG])
    Graph Convolutional Networks (GCNs) achieve an impressive performance due to the remarkable representation ability in learning the graph information. However, GCNs, when implemented on a deep network, require expensive computation power, making them difficult to be deployed on battery-powered devices. In contrast, Spiking Neural Networks (SNNs), which perform a bio-fidelity inference process, offer an energy-efficient neural architecture. In this work, we propose SpikingGCN, an end-to-end framework that aims to integrate the embedding of GCNs with the biofidelity characteristics of SNNs. The original graph data are encoded into spike trains based on the incorporation of graph convolution. We further model biological information processing by utilizing a fully connected layer combined with neuron nodes. In a wide range of scenarios (e.g. citation networks, image graph classification, and recommender systems), our experimental results show that the proposed method could gain competitive performance against state-of-the-art approaches. Furthermore, we show that SpikingGCN on a neuromorphic chip can bring a clear advantage of energy efficiency into graph data analysis, which demonstrates its great potential to construct environment-friendly machine learning models.  ( 2 min )
    Cognitive Radio Resource Scheduling using Multi agent QLearning for LTE. (arXiv:2205.02765v1 [cs.NI])
    In this paper, we propose, implement, and test two novel downlink LTE scheduling algorithms. The implementation and testing of these algorithms were in Matlab, and they are based on the use of Reinforcement Learning, more specifically, the Qlearning technique for scheduling two types of users. The first algorithm is called a Collaborative scheduling algorithm, and the second algorithm is called a Competitive scheduling algorithm. The first type of the scheduled users is the Primary Users, and they are the licensed subscribers that pay for their service. The second type of the scheduled users is the Secondary Users, and they could be unlicensed subscribers that dont pay for their service, device to device communications, or sensors. Each user whether it is a primary or secondary is considered as an agent. In the Collaborative scheduling algorithm, the primary user agents will collaborate in order to make a joint scheduling decision about allocating the resource blocks to each one of them, then the secondary user agents will compete among themselves to use the remaining resource blocks. In the Competitive scheduling algorithm, the primary user agents will compete among themselves over the available resources, then the secondary user agents will compete among themselves over the remaining resources. Experimental results show that both scheduling algorithms converged to almost ninety percent utilization of the spectrum, and provided fair shares of the spectrum among users.  ( 2 min )
    Textless Speech-to-Speech Translation on Real Data. (arXiv:2112.08352v2 [cs.CL] UPDATED)
    We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data. The key to our approach is a self-supervised unit-based speech normalization technique, which finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents, while preserving the lexical content. With only 10 minutes of paired data for speech normalization, we obtain on average 3.2 BLEU gain when training the S2ST model on the VoxPopuli S2ST dataset, compared to a baseline trained on un-normalized speech target. We also incorporate automatically mined S2ST data and show an additional 2.0 BLEU gain. To our knowledge, we are the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs. Audio samples are available at https://facebookresearch.github.io/speech_translation/textless_s2st_real_data/index.html .  ( 2 min )
    Communication-Efficient Adaptive Federated Learning. (arXiv:2205.02719v1 [cs.LG])
    Federated learning is a machine learning training paradigm that enables clients to jointly train models without sharing their own localized data. However, the implementation of federated learning in practice still faces numerous challenges, such as the large communication overhead due to the repetitive server-client synchronization and the lack of adaptivity by SGD-based model updates. Despite that various methods have been proposed for reducing the communication cost by gradient compression or quantization, and the federated versions of adaptive optimizers such as FedAdam are proposed to add more adaptivity, the current federated learning framework still cannot solve the aforementioned challenges all at once. In this paper, we propose a novel communication-efficient adaptive federated learning method (FedCAMS) with theoretical convergence guarantees. We show that in the nonconvex stochastic optimization setting, our proposed FedCAMS achieves the same convergence rate of $O(\frac{1}{\sqrt{TKm}})$ as its non-compressed counterparts. Extensive experiments on various benchmarks verify our theoretical analysis.  ( 2 min )
    CoST: Contrastive Learning of Disentangled Seasonal-Trend Representations for Time Series Forecasting. (arXiv:2202.01575v3 [cs.LG] UPDATED)
    Deep learning has been actively studied for time series forecasting, and the mainstream paradigm is based on the end-to-end training of neural network architectures, ranging from classical LSTM/RNNs to more recent TCNs and Transformers. Motivated by the recent success of representation learning in computer vision and natural language processing, we argue that a more promising paradigm for time series forecasting, is to first learn disentangled feature representations, followed by a simple regression fine-tuning step -- we justify such a paradigm from a causal perspective. Following this principle, we propose a new time series representation learning framework for time series forecasting named CoST, which applies contrastive learning methods to learn disentangled seasonal-trend representations. CoST comprises both time domain and frequency domain contrastive losses to learn discriminative trend and seasonal representations, respectively. Extensive experiments on real-world datasets show that CoST consistently outperforms the state-of-the-art methods by a considerable margin, achieving a 21.3% improvement in MSE on multivariate benchmarks. It is also robust to various choices of backbone encoders, as well as downstream regressors. Code is available at https://github.com/salesforce/CoST.  ( 2 min )
    Polynomial-Time Algorithms for Counting and Sampling Markov Equivalent DAGs with Applications. (arXiv:2205.02654v1 [cs.LG])
    Counting and sampling directed acyclic graphs from a Markov equivalence class are fundamental tasks in graphical causal analysis. In this paper we show that these tasks can be performed in polynomial time, solving a long-standing open problem in this area. Our algorithms are effective and easily implementable. As we show in experiments, these breakthroughs make thought-to-be-infeasible strategies in active learning of causal structures and causal effect identification with regard to a Markov equivalence class practically applicable.  ( 2 min )
    LAWS: Look Around and Warm-Start Natural Gradient Descent for Quantum Neural Networks. (arXiv:2205.02666v1 [quant-ph])
    Variational quantum algorithms (VQAs) have recently received significant attention from the research community due to their promising performance in Noisy Intermediate-Scale Quantum computers (NISQ). However, VQAs run on parameterized quantum circuits (PQC) with randomly initialized parameters are characterized by barren plateaus (BP) where the gradient vanishes exponentially in the number of qubits. In this paper, we first review quantum natural gradient (QNG), which is one of the most popular algorithms used in VQA, from the classical first-order optimization point of view. Then, we proposed a \underline{L}ook \underline{A}round \underline{W}arm-\underline{S}tart QNG (LAWS) algorithm to mitigate the widespread existing BP issues. LAWS is a combinatorial optimization strategy taking advantage of model parameter initialization and fast convergence of QNG. LAWS repeatedly reinitializes parameter search space for the next iteration parameter update. The reinitialized parameter search space is carefully chosen by sampling the gradient close to the current optimal. Moreover, we present a unified framework (WS-SGD) for integrating parameter initialization techniques into the optimizer. We provide the convergence proof of the proposed framework for both convex and non-convex objective functions based on Polyak-Lojasiewicz (PL) condition. Our experiment results show that the proposed algorithm could mitigate the BP and have better generalization ability in quantum classification problems.  ( 2 min )
    Population Predictive Checks. (arXiv:1908.00882v4 [stat.ME] UPDATED)
    Bayesian modeling has become a staple for researchers to articulate assumptions and develop methods tailored for specific data applications. Thanks to recent developments in approximate posterior inference, researchers can easily build, use, and revise complicated Bayesian models for large and rich data. These new abilities, however, bring into focus the problem of model criticism. Researchers need tools to diagnose the fitness of their models, to understand where they fall short, and to guide their revision. In this paper we develop a new method for Bayesian model criticism, the population predictive check (POP-PC). POP-PCs are built on posterior predictive checks (PPCs), a seminal method that checks a model by assessing the posterior predictive distribution on the observed data. However, PPCs use the data twice -- both to calculate the posterior predictive and to evaluate it -- which can lead to overconfident assessments of the quality of a model. POP-PCs, in contrast, compare the posterior predictive distribution to a draw from the population distribution, which in practice is a heldout dataset. We prove this strategy, which blends Bayesian modeling with frequentist assessment, is calibrated, unlike the PPC. Moreover, we demonstrate that calibrating PPC p-values post-hoc does not resolve the "double use of the data" problem. Finally, we study POP-PCs on classical regression and a hierarchical model of text data.  ( 2 min )
    Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning. (arXiv:1708.02190v3 [cs.AI] UPDATED)
    Intrinsically motivated spontaneous exploration is a key enabler of autonomous developmental learning in human children. It enables the discovery of skill repertoires through autotelic learning, i.e. the self-generation, self-selection, self-ordering and self-experimentation of learning goals. We present an algorithmic approach called Intrinsically Motivated Goal Exploration Processes (IMGEP) to enable similar properties of autonomous learning in machines. The IMGEP architecture relies on several principles: 1) self-generation of goals, generalized as parameterized fitness functions; 2) selection of goals based on intrinsic rewards; 3) exploration with incremental goal-parameterized policy search and exploitation with a batch learning algorithm; 4) systematic reuse of information acquired when targeting a goal for improving towards other goals. We present a particularly efficient form of IMGEP, called AMB, that uses a population-based policy and an object-centered spatio-temporal modularity. We provide several implementations of this architecture and demonstrate their ability to automatically generate a learning curriculum within several experimental setups. One of these experiments includes a real humanoid robot exploring multiple spaces of goals with several hundred continuous dimensions and with distractors. While no particular target goal is provided to these autotelic agents, this curriculum allows the discovery of diverse skills that act as stepping stones for learning more complex skills, e.g. nested tool use.  ( 2 min )
    Communication-Efficient Device Scheduling for Federated Learning Using Stochastic Optimization. (arXiv:2201.07912v2 [cs.LG] UPDATED)
    Federated learning (FL) is a useful tool in distributed machine learning that utilizes users' local datasets in a privacy-preserving manner. When deploying FL in a constrained wireless environment; however, training models in a time-efficient manner can be a challenging task due to intermittent connectivity of devices, heterogeneous connection quality, and non-i.i.d. data. In this paper, we provide a novel convergence analysis of non-convex loss functions using FL on both i.i.d. and non-i.i.d. datasets with arbitrary device selection probabilities for each round. Then, using the derived convergence bound, we use stochastic optimization to develop a new client selection and power allocation algorithm that minimizes a function of the convergence bound and the average communication time under a transmit power constraint. We find an analytical solution to the minimization problem. One key feature of the algorithm is that knowledge of the channel statistics is not required and only the instantaneous channel state information needs to be known. Using the FEMNIST and CIFAR-10 datasets, we show through simulations that the communication time can be significantly decreased using our algorithm, compared to uniformly random participation.  ( 2 min )
    A Graph Attention Learning Approach to Antenna Tilt Optimization. (arXiv:2112.14843v2 [cs.LG] UPDATED)
    6G will move mobile networks towards increasing levels of complexity. To deal with this complexity, optimization of network parameters is key to ensure high performance and timely adaptivity to dynamic network environments. The optimization of the antenna tilt provides a practical and cost-efficient method to improve coverage and capacity in the network. Previous methods based on Reinforcement Learning (RL) have shown great promise for tilt optimization by learning adaptive policies outperforming traditional tilt optimization methods. However, most existing RL methods are based on single-cell features representation, which fails to fully characterize the agent state, resulting in suboptimal performance. Also, most of such methods lack scalability, due to state-action explosion, and generalization ability. In this paper, we propose a Graph Attention Q-learning (GAQ) algorithm for tilt optimization. GAQ relies on a graph attention mechanism to select relevant neighbors information, improve the agent state representation, and update the tilt control policy based on a history of observations using a Deep Q-Network (DQN). We show that GAQ efficiently captures important network information and outperforms standard DQN with local information by a large margin. In addition, we demonstrate its ability to generalize to network deployments of different sizes and densities.  ( 2 min )
    Training Recurrent Neural Networks by Sequential Least Squares and the Alternating Direction Method of Multipliers. (arXiv:2112.15348v2 [cs.LG] UPDATED)
    This paper proposes a novel algorithm for training recurrent neural network models of nonlinear dynamical systems from an input/output training dataset. Arbitrary convex and twice-differentiable loss functions and regularization terms are handled by sequential least squares and either a line-search (LS) or a trust-region method \`a la Levenberg-Marquardt (LM) for ensuring convergence. In addition, to handle non-smooth regularization terms such as $\ell_1$, $\ell_0$, and group-Lasso regularizers, as well as to impose possibly non-convex constraints such as integer and mixed-integer constraints, we combine sequential least squares with the alternating direction method of multipliers (ADMM). We call the resulting algorithm NAILS (nonconvex ADMM iterations and least squares) in the case line search (LS) is used, or NAILM if a trust-region method (LM) is employed instead. The training method, which is also applicable to feedforward neural networks as a special case, is tested in two nonlinear system identification problems.  ( 2 min )
    The Role of Explainability in Assuring Safety of Machine Learning in Healthcare. (arXiv:2109.00520v2 [cs.LG] UPDATED)
    Established approaches to assuring safety-critical systems and software are difficult to apply to systems employing ML where there is no clear, pre-defined specification against which to assess validity. This problem is exacerbated by the "opaque" nature of ML where the learnt model is not amenable to human scrutiny. Explainable AI (XAI) methods have been proposed to tackle this issue by producing human-interpretable representations of ML models which can help users to gain confidence and build trust in the ML system. However, little work explicitly investigates the role of explainability for safety assurance in the context of ML development. This paper identifies ways in which XAI methods can contribute to safety assurance of ML-based systems. It then uses a concrete ML-based clinical decision support system, concerning weaning of patients from mechanical ventilation, to demonstrate how XAI methods can be employed to produce evidence to support safety assurance. The results are also represented in a safety argument to show where, and in what way, XAI methods can contribute to a safety case. Overall, we conclude that XAI methods have a valuable role in safety assurance of ML-based systems in healthcare but that they are not sufficient in themselves to assure safety.  ( 2 min )
    View-labels Are Indispensable: A Multifacet Complementarity Study of Multi-view Clustering. (arXiv:2205.02507v1 [cs.LG])
    Consistency and complementarity are two key ingredients for boosting multi-view clustering (MVC). Recently with the introduction of popular contrastive learning, the consistency learning of views has been further enhanced in MVC, leading to promising performance. However, by contrast, the complementarity has not received sufficient attention except just in the feature facet, where the Hilbert Schmidt Independence Criterion (HSIC) term or the independent encoder-decoder network is usually adopted to capture view-specific information. This motivates us to reconsider the complementarity learning of views comprehensively from multiple facets including the feature-, view-label- and contrast- facets, while maintaining the view consistency. We empirically find that all the facets contribute to the complementarity learning, especially the view-label facet, which is usually neglected by existing methods. Based on this, we develop a novel \underline{M}ultifacet \underline{C}omplementarity learning framework for \underline{M}ulti-\underline{V}iew \underline{C}lustering (MCMVC), which fuses multifacet complementarity information, especially explicitly embedding the view-label information. To our best knowledge, it is the first time to use view-labels explicitly to guide the complementarity learning of views. Compared with the SOTA baseline, MCMVC achieves remarkable improvements, e.g., by average margins over $5.00\%$ and $7.00\%$ respectively in complete and incomplete MVC settings on Caltech101-20 in terms of three evaluation metrics.  ( 2 min )
    FAITH: Few-Shot Graph Classification with Hierarchical Task Graphs. (arXiv:2205.02435v1 [cs.LG])
    Few-shot graph classification aims at predicting classes for graphs, given limited labeled graphs for each class. To tackle the bottleneck of label scarcity, recent works propose to incorporate few-shot learning frameworks for fast adaptations to graph classes with limited labeled graphs. Specifically, these works propose to accumulate meta-knowledge across diverse meta-training tasks, and then generalize such meta-knowledge to the target task with a disjoint label set. However, existing methods generally ignore task correlations among meta-training tasks while treating them independently. Nevertheless, such task correlations can advance the model generalization to the target task for better classification performance. On the other hand, it remains non-trivial to utilize task correlations due to the complex components in a large number of meta-training tasks. To deal with this, we propose a novel few-shot learning framework FAITH that captures task correlations via constructing a hierarchical task graph at different granularities. Then we further design a loss-based sampling strategy to select tasks with more correlated classes. Moreover, a task-specific classifier is proposed to utilize the learned task correlations for few-shot classification. Extensive experiments on four prevalent few-shot graph classification datasets demonstrate the superiority of FAITH over other state-of-the-art baselines.  ( 2 min )
    Second-Order Sensitivity Analysis for Bilevel Optimization. (arXiv:2205.02329v1 [math.OC])
    In this work we derive a second-order approach to bilevel optimization, a type of mathematical programming in which the solution to a parameterized optimization problem (the "lower" problem) is itself to be optimized (in the "upper" problem) as a function of the parameters. Many existing approaches to bilevel optimization employ first-order sensitivity analysis, based on the implicit function theorem (IFT), for the lower problem to derive a gradient of the lower problem solution with respect to its parameters; this IFT gradient is then used in a first-order optimization method for the upper problem. This paper extends this sensitivity analysis to provide second-order derivative information of the lower problem (which we call the IFT Hessian), enabling the usage of faster-converging second-order optimization methods at the upper level. Our analysis shows that (i) much of the computation already used to produce the IFT gradient can be reused for the IFT Hessian, (ii) errors bounds derived for the IFT gradient readily apply to the IFT Hessian, (iii) computing IFT Hessians can significantly reduce overall computation by extracting more information from each lower level solve. We corroborate our findings and demonstrate the broad range of applications of our method by applying it to problem instances of least squares hyperparameter auto-tuning, multi-class SVM auto-tuning, and inverse optimal control.
    Predicting Basin Stability of Power Grids using Graph Neural Networks. (arXiv:2108.08230v3 [physics.soc-ph] UPDATED)
    The prediction of dynamical stability of power grids becomes more important and challenging with increasing shares of renewable energy sources due to their decentralized structure, reduced inertia and volatility. We investigate the feasibility of applying graph neural networks (GNN) to predict dynamic stability of synchronisation in complex power grids using the single-node basin stability (SNBS) as a measure. To do so, we generate two synthetic datasets for grids with 20 and 100 nodes respectively and estimate SNBS using Monte-Carlo sampling. Those datasets are used to train and evaluate the performance of eight different GNN-models. All models use the full graph without simplifications as input and predict SNBS in a nodal-regression-setup. We show that SNBS can be predicted in general and the performance significantly changes using different GNN-models. Furthermore, we observe interesting transfer capabilities of our approach: GNN-models trained on smaller grids can directly be applied on larger grids without the need of retraining.
    Non-Euclidean Differentially Private Stochastic Convex Optimization: Optimal Rates in Linear Time. (arXiv:2103.01278v2 [cs.LG] UPDATED)
    Differentially private (DP) stochastic convex optimization (SCO) is a fundamental problem, where the goal is to approximately minimize the population risk with respect to a convex loss function, given a dataset of $n$ i.i.d. samples from a distribution, while satisfying differential privacy with respect to the dataset. Most of the existing works in the literature of private convex optimization focus on the Euclidean (i.e., $\ell_2$) setting, where the loss is assumed to be Lipschitz (and possibly smooth) w.r.t. the $\ell_2$ norm over a constraint set with bounded $\ell_2$ diameter. Algorithms based on noisy stochastic gradient descent (SGD) are known to attain the optimal excess risk in this setting. In this work, we conduct a systematic study of DP-SCO for $\ell_p$-setups under a standard smoothness assumption on the loss. For $1< p\leq 2$, under a standard smoothness assumption, we give a new, linear-time DP-SCO algorithm with optimal excess risk. Previously known constructions with optimal excess risk for $1< p <2$ run in super-linear time in $n$. For $p=1$, we give an algorithm with nearly optimal excess risk. Our result for the $\ell_1$-setup also extends to general polyhedral norms and feasible sets. Moreover, we show that the excess risk bounds resulting from our algorithms for $1\leq p \leq 2$ are attained with high probability. For $2 < p \leq \infty$, we show that existing linear-time constructions for the Euclidean setup attain a nearly optimal excess risk in the low-dimensional regime. As a consequence, we show that such constructions attain a nearly optimal excess risk for $p=\infty$. Our work draws upon concepts from the geometry of normed spaces, such as the notions of regularity, uniform convexity, and uniform smoothness.
    Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention. (arXiv:2112.03254v3 [cs.CL] UPDATED)
    Most of today's AI systems focus on using self-attention mechanisms and transformer architectures on large amounts of diverse data to achieve impressive performance gains. In this paper, we propose to augment the transformer architecture with an external attention mechanism to bring external knowledge and context to bear. By integrating external information into the prediction process, we hope to reduce the need for ever-larger models and increase the democratization of AI systems. We find that the proposed external attention mechanism can significantly improve the performance of existing AI systems, allowing practitioners to easily customize foundation AI models to many diverse downstream applications. In particular, we focus on the task of Commonsense Reasoning, demonstrating that the proposed external attention mechanism can augment existing transformer models and significantly improve the model's reasoning capabilities. The proposed system, Knowledgeable External Attention for commonsense Reasoning (KEAR), reaches human parity on the open CommonsenseQA research benchmark with an accuracy of 89.4\% in comparison to the human accuracy of 88.9\%.
    Maximum Entropy RL (Provably) Solves Some Robust RL Problems. (arXiv:2103.06257v2 [cs.LG] UPDATED)
    Many potential applications of reinforcement learning (RL) require guarantees that the agent will perform well in the face of disturbances to the dynamics or reward function. In this paper, we prove theoretically that maximum entropy (MaxEnt) RL maximizes a lower bound on a robust RL objective, and thus can be used to learn policies that are robust to some disturbances in the dynamics and the reward function. While this capability of MaxEnt RL has been observed empirically in prior work, to the best of our knowledge our work provides the first rigorous proof and theoretical characterization of the MaxEnt RL robust set. While a number of prior robust RL algorithms have been designed to handle similar disturbances to the reward function or dynamics, these methods typically require additional moving parts and hyperparameters on top of a base RL algorithm. In contrast, our results suggest that MaxEnt RL by itself is robust to certain disturbances, without requiring any additional modifications. While this does not imply that MaxEnt RL is the best available robust RL method, MaxEnt RL is a simple robust RL method with appealing formal guarantees.
    Dynamic Bayesian Network Auxiliary ABC-SMC for Hybrid Model Bayesian Inference to Accelerate Biomanufacturing Process Mechanism Learning and Robust Control. (arXiv:2205.02410v1 [stat.ML])
    Driven by the critical needs of biomanufacturing 4.0, we present a probabilistic knowledge graph hybrid model characterizing complex spatial-temporal causal interdependencies of underlying bioprocessing mechanisms. It can faithfully capture the important properties, including nonlinear reactions, partially observed state, and nonstationary dynamics. Given limited process observations, we derive a posterior distribution quantifying model uncertainty, which can facilitate mechanism learning and support robust process control. To avoid evaluation of intractable likelihood, Approximate Bayesian Computation sampling with Sequential Monte Carlo (ABC-SMC) is developed to approximate the posterior distribution. Given high stochastic and model uncertainties, it is computationally expensive to match process output trajectories. Therefore, we propose a linear Gaussian dynamic Bayesian network (LG-DBN) auxiliary likelihood-based ABC-SMC algorithm. Through matching observed and simulated summary statistics, the proposed approach can dramatically reduce the computation cost and accelerate the posterior approximation convergence.
    SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling. (arXiv:2005.06377v3 [cs.CL] UPDATED)
    Canonical automatic summary evaluation metrics, such as ROUGE, focus on lexical similarity which cannot well capture semantics nor linguistic quality and require a reference summary which is costly to obtain. Recently, there have been a growing number of efforts to alleviate either or both of the two drawbacks. In this paper, we present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries. Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries. In cross-domain tests, our strategy outperforms baselines with promising improvements, and show a great advantage in gauging linguistic qualities over all metrics.  ( 2 min )
    Characterizing Intersectional Group Fairness with Worst-Case Comparisons. (arXiv:2101.01673v5 [cs.LG] UPDATED)
    Machine Learning or Artificial Intelligence algorithms have gained considerable scrutiny in recent times owing to their propensity towards imitating and amplifying existing prejudices in society. This has led to a niche but growing body of work that identifies and attempts to fix these biases. A first step towards making these algorithms more fair is designing metrics that measure unfairness. Most existing work in this field deals with either a binary view of fairness (protected vs. unprotected groups) or politically defined categories (race or gender). Such categorization misses the important nuance of intersectionality - biases can often be amplified in subgroups that combine membership from different categories, especially if such a subgroup is particularly underrepresented in historical platforms of opportunity. In this paper, we discuss why fairness metrics need to be looked at under the lens of intersectionality, identify existing work in intersectional fairness, suggest a simple worst case comparison method to expand the definitions of existing group fairness metrics to incorporate intersectionality, and finally conclude with the social, legal and political framework to handle intersectional fairness in the modern context.  ( 2 min )
    Physics-Informed Deep Reversible Regression Model for Temperature Field Reconstruction of Heat-Source Systems. (arXiv:2106.11929v4 [cs.LG] UPDATED)
    Temperature monitoring during the life time of heat source components in engineering systems becomes essential to guarantee the normal work and the working life of these components. However, prior methods, which mainly use the interpolate estimation to reconstruct the temperature field from limited monitoring points, require large amounts of temperature tensors for an accurate estimation. This may decrease the availability and reliability of the system and sharply increase the monitoring cost. To solve this problem, this work develops a novel physics-informed deep reversible regression models for temperature field reconstruction of heat-source systems (TFR-HSS), which can better reconstruct the temperature field with limited monitoring points unsupervisedly. First, we define the TFR-HSS task mathematically, and numerically model the task, and hence transform the task as an image-to-image regression problem. Then this work develops the deep reversible regression model which can better learn the physical information, especially over the boundary. Finally, considering the physical characteristics of heat conduction as well as the boundary conditions, this work proposes the physics-informed reconstruction loss including four training losses and jointly learns the deep surrogate model with these losses unsupervisedly. Experimental studies have conducted over typical two-dimensional heat-source systems to demonstrate the effectiveness of the proposed method.  ( 2 min )
    Dropout Strikes Back: Improved Uncertainty Estimation via Diversity Sampling. (arXiv:2003.03274v3 [cs.LG] UPDATED)
    Uncertainty estimation for machine learning models is of high importance in many scenarios such as constructing the confidence intervals for model predictions and detection of out-of-distribution or adversarially generated points. In this work, we show that modifying the sampling distributions for dropout layers in neural networks improves the quality of uncertainty estimation. Our main idea consists of two main steps: computing data-driven correlations between neurons and generating samples, which include maximally diverse neurons. In a series of experiments on simulated and real-world data, we demonstrate that the diversification via determinantal point processes-based sampling achieves state-of-the-art results in uncertainty estimation for regression and classification tasks. An important feature of our approach is that it does not require any modification to the models or training procedures, allowing straightforward application to any deep learning model with dropout layers.  ( 2 min )
    Development of Interpretable Machine Learning Models to Detect Arrhythmia based on ECG Data. (arXiv:2205.02803v1 [cs.LG])
    The analysis of electrocardiogram (ECG) signals can be time consuming as it is performed manually by cardiologists. Therefore, automation through machine learning (ML) classification is being increasingly proposed which would allow ML models to learn the features of a heartbeat and detect abnormalities. The lack of interpretability hinders the application of Deep Learning in healthcare. Through interpretability of these models, we would understand how a machine learning algorithm makes its decisions and what patterns are being followed for classification. This thesis builds Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) classifiers based on state-of-the-art models and compares their performance and interpretability to shallow classifiers. Here, both global and local interpretability methods are exploited to understand the interaction between dependent and independent variables across the entire dataset and to examine model decisions in each sample, respectively. Partial Dependence Plots, Shapley Additive Explanations, Permutation Feature Importance, and Gradient Weighted Class Activation Maps (Grad-Cam) are the four interpretability techniques implemented on time-series ML models classifying ECG rhythms. In particular, we exploit Grad-Cam, which is a local interpretability technique and examine whether its interpretability varies between correctly and incorrectly classified ECG beats within each class. Furthermore, the classifiers are evaluated using K-Fold cross-validation and Leave Groups Out techniques, and we use non-parametric statistical testing to examine whether differences are significant. It was found that Grad-CAM was the most effective interpretability technique at explaining predictions of proposed CNN and LSTM models. We concluded that all high performing classifiers looked at the QRS complex of the ECG rhythm when making predictions.  ( 2 min )
    Optimising Equal Opportunity Fairness in Model Training. (arXiv:2205.02393v1 [cs.LG])
    Real-world datasets often encode stereotypes and societal biases. Such biases can be implicitly captured by trained models, leading to biased predictions and exacerbating existing societal preconceptions. Existing debiasing methods, such as adversarial training and removing protected information from representations, have been shown to reduce bias. However, a disconnect between fairness criteria and training objectives makes it difficult to reason theoretically about the effectiveness of different techniques. In this work, we propose two novel training objectives which directly optimise for the widely-used criterion of {\it equal opportunity}, and show that they are effective in reducing bias while maintaining high performance over two classification tasks.  ( 2 min )
    Meta-learning Feature Representations for Adaptive Gaussian Processes via Implicit Differentiation. (arXiv:2205.02708v1 [cs.LG])
    We propose Adaptive Deep Kernel Fitting (ADKF), a general framework for learning deep kernels by interpolating between meta-learning and conventional learning. Our approach employs a bilevel optimization objective where we meta-learn feature representations that are generally useful across tasks, in the sense that task-specific Gaussian process models estimated on top of such features achieve the lowest possible predictive loss on average across tasks. We solve the resulting nested optimization problem using the implicit function theorem. We show that ADKF contains Deep Kernel Learning and Deep Kernel Transfer as special cases. Although ADKF is a completely general method, we argue that it is especially well-suited for drug discovery problems and demonstrate that it significantly outperforms previous state-of-the-art methods on a variety of real-world few-shot molecular property prediction tasks and out-of-domain molecular optimization tasks.  ( 2 min )
    StyleAlign: Analysis and Applications of Aligned StyleGAN Models. (arXiv:2110.11323v2 [cs.CV] UPDATED)
    In this paper, we perform an in-depth study of the properties and applications of aligned generative models. We refer to two models as aligned if they share the same architecture, and one of them (the child) is obtained from the other (the parent) via fine-tuning to another domain, a common practice in transfer learning. Several works already utilize some basic properties of aligned StyleGAN models to perform image-to-image translation. Here, we perform the first detailed exploration of model alignment, also focusing on StyleGAN. First, we empirically analyze aligned models and provide answers to important questions regarding their nature. In particular, we find that the child model's latent spaces are semantically aligned with those of the parent, inheriting incredibly rich semantics, even for distant data domains such as human faces and churches. Second, equipped with this better understanding, we leverage aligned models to solve a diverse set of tasks. In addition to image translation, we demonstrate fully automatic cross-domain image morphing. We further show that zero-shot vision tasks may be performed in the child domain, while relying exclusively on supervision in the parent domain. We demonstrate qualitatively and quantitatively that our approach yields state-of-the-art results, while requiring only simple fine-tuning and inversion.  ( 2 min )
    Can collaborative learning be private, robust and scalable?. (arXiv:2205.02652v1 [cs.LG])
    We investigate the effectiveness of combining differential privacy, model compression and adversarial training to improve the robustness of models against adversarial samples in train- and inference-time attacks. We explore the applications of these techniques as well as their combinations to determine which method performs best, without a significant utility trade-off. Our investigation provides a practical overview of various methods that allow one to achieve a competitive model performance, a significant reduction in model's size and an improved empirical adversarial robustness without a severe performance degradation.  ( 2 min )
    Rapid Locomotion via Reinforcement Learning. (arXiv:2205.02824v1 [cs.RO])
    Agile maneuvers such as sprinting and high-speed turning in the wild are challenging for legged robots. We present an end-to-end learned controller that achieves record agility for the MIT Mini Cheetah, sustaining speeds up to 3.9 m/s. This system runs and turns fast on natural terrains like grass, ice, and gravel and responds robustly to disturbances. Our controller is a neural network trained in simulation via reinforcement learning and transferred to the real world. The two key components are (i) an adaptive curriculum on velocity commands and (ii) an online system identification strategy for sim-to-real transfer leveraged from prior work. Videos of the robot's behaviors are available at: https://agility.csail.mit.edu/  ( 2 min )
    Multi-fold Correlation Attention Network for Predicting Traffic Speeds with Heterogeneous Frequency. (arXiv:2104.09083v2 [cs.LG] UPDATED)
    Substantial efforts have been devoted to the investigation of spatiotemporal correlations for improving traffic speed prediction accuracy. However, existing works typically model the correlations based solely on the observed traffic state (e.g. traffic speed) without due consideration that different correlation measurements of the traffic data could exhibit a diverse set of patterns under different traffic situations. In addition, the existing works assume that all road segments can employ the same sampling frequency of traffic states, which is impractical. In this paper, we propose new measurements to model the spatial correlations among traffic data and show that the resulting correlation patterns vary significantly under various traffic situations. We propose a Heterogeneous Spatial Correlation (HSC) model to capture the spatial correlation based on a specific measurement, where the traffic data of varying road segments can be heterogeneous (i.e. obtained with different sampling frequency). We propose a Multi-fold Correlation Attention Network (MCAN), which relies on the HSC model to explore multi-fold spatial correlations and leverage LSTM networks to capture multi-fold temporal correlations to provide discriminating features in order to achieve accurate traffic prediction. The learned multi-fold spatiotemporal correlations together with contextual factors are fused with attention mechanism to make the final predictions. Experiments on real-world datasets demonstrate that the proposed MCAN model outperforms the state-of-the-art baselines.  ( 2 min )
    On the Dual Formulation of Boosting Algorithms. (arXiv:0901.3590v6 [cs.LG] UPDATED)
    We study boosting algorithms from a new perspective. We show that the Lagrange dual problems of AdaBoost, LogitBoost and soft-margin LPBoost with generalized hinge loss are all entropy maximization problems. By looking at the dual problems of these boosting algorithms, we show that the success of boosting algorithms can be understood in terms of maintaining a better margin distribution by maximizing margins and at the same time controlling the margin variance.We also theoretically prove that, approximately, AdaBoost maximizes the average margin, instead of the minimum margin. The duality formulation also enables us to develop column generation based optimization algorithms, which are totally corrective. We show that they exhibit almost identical classification results to that of standard stage-wise additive boosting algorithms but with much faster convergence rates. Therefore fewer weak classifiers are needed to build the ensemble using our proposed optimization technique.
    Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics. (arXiv:2205.02835v1 [cs.RO])
    Differentiable physics has recently been shown as a powerful tool for solving soft-body manipulation tasks. However, the differentiable physics solver often gets stuck when the initial contact points of the end effectors are sub-optimal or when performing multi-stage tasks that require contact point switching, which often leads to local minima. To address this challenge, we propose a contact point discovery approach (CPDeform) that guides the stand-alone differentiable physics solver to deform various soft-body plasticines. The key idea of our approach is to integrate optimal transport-based contact points discovery into the differentiable physics solver to overcome the local minima from initial contact points or contact switching. On single-stage tasks, our method can automatically find suitable initial contact points based on transport priorities. On complex multi-stage tasks, we can iteratively switch the contact points of end-effectors based on transport priorities. To evaluate the effectiveness of our method, we introduce PlasticineLab-M that extends the existing differentiable physics benchmark PlasticineLab to seven new challenging multi-stage soft-body manipulation tasks. Extensive experimental results suggest that: 1) on multi-stage tasks that are infeasible for the vanilla differentiable physics solver, our approach discovers contact points that efficiently guide the solver to completion; 2) on tasks where the vanilla solver performs sub-optimally or near-optimally, our contact point discovery method performs better than or on par with the manipulation performance obtained with handcrafted contact points.
    Time Shifts to Reduce the Size of Reservoir Computers. (arXiv:2205.02267v1 [cs.NE])
    A reservoir computer is a type of dynamical system arranged to do computation. Typically, a reservoir computer is constructed by connecting a large number of nonlinear nodes in a network that includes recurrent connections. In order to achieve accurate results, the reservoir usually contains hundreds to thousands of nodes. This high dimensionality makes it difficult to analyze the reservoir computer using tools from dynamical systems theory. Additionally, the need to create and connect large numbers of nonlinear nodes makes it difficult to design and build analog reservoir computers that can be faster and consume less power than digital reservoir computers. We demonstrate here that a reservoir computer may be divided into two parts; a small set of nonlinear nodes (the reservoir), and a separate set of time-shifted reservoir output signals. The time-shifted output signals serve to increase the rank and memory of the reservoir computer, and the set of nonlinear nodes may create an embedding of the input dynamical system. We use this time-shifting technique to obtain excellent performance from an opto-electronic delay-based reservoir computer with only a small number of virtual nodes. Because only a few nonlinear nodes are required, construction of a reservoir computer becomes much easier, and delay-based reservoir computers can operate at much higher speeds.
    General sum stochastic games with networked information flows. (arXiv:2205.02760v1 [cs.LG])
    Inspired by applications such as supply chain management, epidemics, and social networks, we formulate a stochastic game model that addresses three key features common across these domains: 1) network-structured player interactions, 2) pair-wise mixed cooperation and competition among players, and 3) limited global information toward individual decision-making. In combination, these features pose significant challenges for black box approaches taken by deep learning-based multi-agent reinforcement learning (MARL) algorithms and deserve more detailed analysis. We formulate a networked stochastic game with pair-wise general sum objectives and asymmetrical information structure, and empirically explore the effects of information availability on the outcomes of different MARL paradigms such as individual learning and centralized learning decentralized execution.
    Identifying Cause-and-Effect Relationships of Manufacturing Errors using Sequence-to-Sequence Learning. (arXiv:2205.02827v1 [cs.LG])
    In car-body production the pre-formed sheet metal parts of the body are assembled on fully-automated production lines. The body passes through multiple stations in succession, and is processed according to the order requirements. The timely completion of orders depends on the individual station-based operations concluding within their scheduled cycle times. If an error occurs in one station, it can have a knock-on effect, resulting in delays on the downstream stations. To the best of our knowledge, there exist no methods for automatically distinguishing between source and knock-on errors in this setting, as well as establishing a causal relation between them. Utilizing real-time information about conditions collected by a production data acquisition system, we propose a novel vehicle manufacturing analysis system, which uses deep learning to establish a link between source and knock-on errors. We benchmark three sequence-to-sequence models, and introduce a novel composite time-weighted action metric for evaluating models in this context. We evaluate our framework on a real-world car production dataset recorded by Volkswagen Commercial Vehicles. Surprisingly we find that 71.68% of sequences contain either a source or knock-on error. With respect to seq2seq model training, we find that the Transformer demonstrates a better performance compared to LSTM and GRU in this domain, in particular when the prediction range with respect to the durations of future actions is increased.  ( 2 min )
    Maximum n-times Coverage for Vaccine Design. (arXiv:2101.10902v5 [q-bio.QM] UPDATED)
    We introduce the maximum $n$-times coverage problem that selects $k$ overlays to maximize the summed coverage of weighted elements, where each element must be covered at least $n$ times. We also define the min-cost $n$-times coverage problem where the objective is to select the minimum set of overlays such that the sum of the weights of elements that are covered at least $n$ times is at least $\tau$. Maximum $n$-times coverage is a generalization of the multi-set multi-cover problem, is NP-complete, and is not submodular. We introduce two new practical solutions for $n$-times coverage based on integer linear programming and sequential greedy optimization. We show that maximum $n$-times coverage is a natural way to frame peptide vaccine design, and find that it produces a pan-strain COVID-19 vaccine design that is superior to 29 other published designs in predicted population coverage and the expected number of peptides displayed by each individual's HLA molecules.
    Sound Event Classification in an Industrial Environment: Pipe Leakage Detection Use Case. (arXiv:2205.02706v1 [cs.LG])
    In this work, a multi-stage Machine Learning (ML) pipeline is proposed for pipe leakage detection in an industrial environment. As opposed to other industrial and urban environments, the environment under study includes many interfering background noises, complicating the identification of leaks. Furthermore, the harsh environmental conditions limit the amount of data collected and impose the use of low-complexity algorithms. To address the environment's constraints, the developed ML pipeline applies multiple steps, each addressing the environment's challenges. The proposed ML pipeline first reduces the data dimensionality by feature selection techniques and then incorporates time correlations by extracting time-based features. The resultant features are fed to a Support Vector Machine (SVM) of low-complexity that generalizes well to a small amount of data. An extensive experimental procedure was carried out on two datasets, one with background industrial noise and one without, to evaluate the validity of the proposed pipeline. The SVM hyper-parameters and parameters specific to the pipeline steps were tuned as part of the experimental procedure. The best models obtained from the dataset with industrial noise and leaks were applied to datasets without noise and with and without leaks to test their generalizability. The results show that the model produces excellent results with 99\% accuracy and an F1-score of 0.93 and 0.9 for the respective datasets.
    Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction. (arXiv:2205.02834v1 [cs.CV])
    This paper studies the problem of fixing malfunctional 3D objects. While previous works focus on building passive perception models to learn the functionality from static 3D objects, we argue that functionality is reckoned with respect to the physical interactions between the object and the user. Given a malfunctional object, humans can perform mental simulations to reason about its functionality and figure out how to fix it. Inspired by this, we propose FixIt, a dataset that contains about 5k poorly-designed 3D physical objects paired with choices to fix them. To mimic humans' mental simulation process, we present FixNet, a novel framework that seamlessly incorporates perception and physical dynamics. Specifically, FixNet consists of a perception module to extract the structured representation from the 3D point cloud, a physical dynamics prediction module to simulate the results of interactions on 3D objects, and a functionality prediction module to evaluate the functionality and choose the correct fix. Experimental results show that our framework outperforms baseline models by a large margin, and can generalize well to objects with similar interaction types.  ( 2 min )
    pyRDF2Vec: A Python Implementation and Extension of RDF2Vec. (arXiv:2205.02283v1 [cs.LG])
    This paper introduces pyRDF2Vec, a Python software package that reimplements the well-known RDF2Vec algorithm along with several of its extensions. By making the algorithm available in the most popular data science language, and by bundling all extensions into a single place, the use of RDF2Vec is simplified for data scientists. The package is released under a MIT license and structured in such a way to foster further research into sampling, walking, and embedding strategies, which are vital components of the RDF2Vec algorithm. Several optimisations have been implemented in \texttt{pyRDF2Vec} that allow for more efficient walk extraction than the original algorithm. Furthermore, best practices in terms of code styling, testing, and documentation were applied such that the package is future-proof as well as to facilitate external contributions.  ( 2 min )
    Multi-Agent Deep Reinforcement Learning in Vehicular OCC. (arXiv:2205.02672v1 [cs.LG])
    Optical camera communications (OCC) has emerged as a key enabling technology for the seamless operation of future autonomous vehicles. In this paper, we introduce a spectral efficiency optimization approach in vehicular OCC. Specifically, we aim at optimally adapting the modulation order and the relative speed while respecting bit error rate and latency constraints. As the optimization problem is NP-hard problem, we model the optimization problem as a Markov decision process (MDP) to enable the use of solutions that can be applied online. We then relaxed the constrained problem by employing Lagrange relaxation approach before solving it by multi-agent deep reinforcement learning (DRL). We verify the performance of our proposed scheme through extensive simulations and compare it with various variants of our approach and a random method. The evaluation shows that our system achieves significantly higher sum spectral efficiency compared to schemes under comparison.  ( 2 min )
    Finding Bipartite Components in Hypergraphs. (arXiv:2205.02771v1 [cs.DS])
    Hypergraphs are important objects to model ternary or higher-order relations of objects, and have a number of applications in analysing many complex datasets occurring in practice. In this work we study a new heat diffusion process in hypergraphs, and employ this process to design a polynomial-time algorithm that approximately finds bipartite components in a hypergraph. We theoretically prove the performance of our proposed algorithm, and compare it against the previous state-of-the-art through extensive experimental analysis on both synthetic and real-world datasets. We find that our new algorithm consistently and significantly outperforms the previous state-of-the-art across a wide range of hypergraphs.  ( 2 min )
    KnitCity: a machine learning-based, game-theoretical framework for prediction assessment and seismic risk policy design. (arXiv:2205.02679v1 [cs.LG])
    Knitted fabric exhibits avalanche-like events when deformed: by analogy with eathquakes, we are interested in predicting these "knitquakes". However, as in most analogous seismic models, the peculiar statistics of the corresponding time-series severely jeopardize this endeavour, due to the time intermittence and scale-invariance of these events. But more importantly, such predictions are hard to {\it assess}: depending on the choice of what to predict, the results can be very different and not easily compared. Furthermore, forecasting models may be trained with various generic metrics which ignore some important specificities of the problem at hand, in our case seismic risk. Finally, these models often do not provide a clear strategy regarding the best way to use these predictions in practice. Here we introduce a framework that allows to design, evaluate and compare not only predictors but also decision-making policies: a model seismically active {\it city} subjected to the crackling dynamics observed in the mechanical response of knitted fabric. We thus proceed to study the population of KnitCity, introducing a policy through which the mayor of the town can decide to either keep people in, which in case of large events cause human loss, or evacuate the city, which costs a daily fee. The policy only relies on past seismic observations. We construct efficient policies using a reinforcement learning environment and various time-series predictors based on artificial neural networks. By inducing a physically motivated metric on the predictors, this mechanism allows quantitative assessment and comparison of their relevance in the decision-making process.  ( 2 min )
    Morphological Wobbling Can Help Robots Learn. (arXiv:2205.02811v1 [cs.LG])
    We propose to make the physical characteristics of a robot oscillate while it learns to improve its behavioral performance. We consider quantities such as mass, actuator strength, and size that are usually fixed in a robot, and show that when those quantities oscillate at the beginning of the learning process on a simulated 2D soft robot, the performance on a locomotion task can be significantly improved. We investigate the dynamics of the phenomenon and conclude that in our case, surprisingly, a high-frequency oscillation with a large amplitude for a large portion of the learning duration leads to the highest performance benefits. Furthermore, we show that morphological wobbling significantly increases exploration of the search space.  ( 2 min )
    Quantum Extremal Learning. (arXiv:2205.02807v1 [quant-ph])
    We propose a quantum algorithm for `extremal learning', which is the process of finding the input to a hidden function that extremizes the function output, without having direct access to the hidden function, given only partial input-output (training) data. The algorithm, called quantum extremal learning (QEL), consists of a parametric quantum circuit that is variationally trained to model data input-output relationships and where a trainable quantum feature map, that encodes the input data, is analytically differentiated in order to find the coordinate that extremizes the model. This enables the combination of established quantum machine learning modelling with established quantum optimization, on a single circuit/quantum computer. We have tested our algorithm on a range of classical datasets based on either discrete or continuous input variables, both of which are compatible with the algorithm. In case of discrete variables, we test our algorithm on synthetic problems formulated based on Max-Cut problem generators and also considering higher order correlations in the input-output relationships. In case of the continuous variables, we test our algorithm on synthetic datasets in 1D and simple ordinary differential functions. We find that the algorithm is able to successfully find the extremal value of such problems, even when the training dataset is sparse or a small fraction of the input configuration space. We additionally show how the algorithm can be used for much more general cases of higher dimensionality, complex differential equations, and with full flexibility in the choice of both modeling and optimization ansatz. We envision that due to its general framework and simple construction, the QEL algorithm will be able to solve a wide variety of applications in different fields, opening up areas of further research.  ( 2 min )
    Unsupervised Mismatch Localization in Cross-Modal Sequential Data. (arXiv:2205.02670v1 [cs.LG])
    Content mismatch usually occurs when data from one modality is translated to another, e.g. language learners producing mispronunciations (errors in speech) when reading a sentence (target text) aloud. However, most existing alignment algorithms assume the content involved in the two modalities is perfectly matched and thus leading to difficulty in locating such mismatch between speech and text. In this work, we develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal sequential data, especially for speech-text sequences. More specifically, we propose a hierarchical Bayesian deep learning model, named mismatch localization variational autoencoder (ML-VAE), that decomposes the generative process of the speech into hierarchically structured latent variables, indicating the relationship between the two modalities. Training such a model is very challenging due to the discrete latent variables with complex dependencies involved. We propose a novel and effective training procedure which estimates the hard assignments of the discrete latent variables over a specifically designed lattice and updates the parameters of neural networks alternatively. Our experimental results show that ML-VAE successfully locates the mismatch between text and speech, without the need for human annotations for model training.  ( 2 min )
    Towards Fast Simulation of Environmental Fluid Mechanics with Multi-Scale Graph Neural Networks. (arXiv:2205.02637v1 [physics.flu-dyn])
    Numerical simulators are essential tools in the study of natural fluid-systems, but their performance often limits application in practice. Recent machine-learning approaches have demonstrated their ability to accelerate spatio-temporal predictions, although, with only moderate accuracy in comparison. Here we introduce MultiScaleGNN, a novel multi-scale graph neural network model for learning to infer unsteady continuum mechanics in problems encompassing a range of length scales and complex boundary geometries. We demonstrate this method on advection problems and incompressible fluid dynamics, both fundamental phenomena in oceanic and atmospheric processes. Our results show good extrapolation to new domain geometries and parameters for long-term temporal simulations. Simulations obtained with MultiScaleGNN are between two and four orders of magnitude faster than those on which it was trained.  ( 2 min )
    When Fair Ranking Meets Uncertain Inference. (arXiv:2105.02091v2 [cs.IR] UPDATED)
    Existing fair ranking systems, especially those designed to be demographically fair, assume that accurate demographic information about individuals is available to the ranking algorithm. In practice, however, this assumption may not hold -- in real-world contexts like ranking job applicants or credit seekers, social and legal barriers may prevent algorithm operators from collecting peoples' demographic information. In these cases, algorithm operators may attempt to infer peoples' demographics and then supply these inferences as inputs to the ranking algorithm. In this study, we investigate how uncertainty and errors in demographic inference impact the fairness offered by fair ranking algorithms. Using simulations and three case studies with real datasets, we show how demographic inferences drawn from real systems can lead to unfair rankings. Our results suggest that developers should not use inferred demographic data as input to fair ranking algorithms, unless the inferences are extremely accurate.  ( 2 min )
    What is Right for Me is Not Yet Right for You: A Dataset for Grounding Relative Directions via Multi-Task Learning. (arXiv:2205.02671v1 [cs.CV])
    Understanding spatial relations is essential for intelligent agents to act and communicate in the physical world. Relative directions are spatial relations that describe the relative positions of target objects with regard to the intrinsic orientation of reference objects. Grounding relative directions is more difficult than grounding absolute directions because it not only requires a model to detect objects in the image and to identify spatial relation based on this information, but it also needs to recognize the orientation of objects and integrate this information into the reasoning process. We investigate the challenging problem of grounding relative directions with end-to-end neural networks. To this end, we provide GRiD-3D, a novel dataset that features relative directions and complements existing visual question answering (VQA) datasets, such as CLEVR, that involve only absolute directions. We also provide baselines for the dataset with two established end-to-end VQA models. Experimental evaluations show that answering questions on relative directions is feasible when questions in the dataset simulate the necessary subtasks for grounding relative directions. We discover that those subtasks are learned in an order that reflects the steps of an intuitive pipeline for processing relative directions.  ( 2 min )
    Rethinking Classifier And Adversarial Attack. (arXiv:2205.02743v1 [cs.LG])
    Various defense models have been proposed to resist adversarial attack algorithms, but existing adversarial robustness evaluation methods always overestimate the adversarial robustness of these models (i.e. not approaching the lower bound of robustness). To solve this problem, this paper first uses the Decouple Space method to divide the classifier into two parts: non-linear and linear. On this basis, this paper defines the representation vector of original example (and its space, i.e., the representation space) and uses Absolute Classification Boundaries Initialization (ACBI) iterative optimization to obtain a better attack starting point (i.e. attacking from this point can approach the lower bound of robustness faster). Particularly, this paper apply ACBI to nearly 50 widely-used defense models (including 8 architectures). Experimental results show that ACBI achieves lower robust accuracy in all cases.  ( 2 min )
    A collection of invited non-archival papers for the Conference on Health, Inference, and Learning (CHIL) 2022. (arXiv:2205.02752v1 [cs.LG])
    A collection of invited non-archival papers for the Conference on Health, Inference, and Learning (CHIL) 2022. This index is incomplete as some authors of invited non-archival presentations opted not to include their papers in this index.
    Chemoreception and chemotaxis of a three-sphere swimmer. (arXiv:2205.02678v1 [cs.LG])
    The coupled problem of hydrodynamics and solute transport for the Najafi-Golestanian three-sphere swimmer is studied, with the Reynolds number set to zero and P\'eclet numbers (Pe) ranging from 0.06 to 60. The adopted method is the numerical simulation of the problem with a finite element code based upon the FEniCS library. For the swimmer executing the optimal locomotion gait, we report the Sherwood number as a function of Pe in homogeneous fluids and confirm that little gain in solute flux is achieved by swimming unless Pe is significantly larger than 10. We also consider the swimmer as an learning agent moving inside a fluid that has a concentration gradient. The outcomes of Q-learning processes show that learning locomotion (with the displacement as reward) is significantly easier than learning chemotaxis (with the increase of solute flux as reward). The chemotaxis problem, even at low Pe, has a varying environment that renders learning more difficult. Further, the learning difficulty increases severely with the P\'eclet number. The results demonstrate the challenges that natural and artificial swimmers need to overcome to migrate efficiently when exposed to chemical inhomogeneities.  ( 2 min )
    Holistic Approach to Measure Sample-level Adversarial Vulnerability and its Utility in Building Trustworthy Systems. (arXiv:2205.02604v1 [cs.CV])
    Adversarial attack perturbs an image with an imperceptible noise, leading to incorrect model prediction. Recently, a few works showed inherent bias associated with such attack (robustness bias), where certain subgroups in a dataset (e.g. based on class, gender, etc.) are less robust than others. This bias not only persists even after adversarial training, but often results in severe performance discrepancies across these subgroups. Existing works characterize the subgroup's robustness bias by only checking individual sample's proximity to the decision boundary. In this work, we argue that this measure alone is not sufficient and validate our argument via extensive experimental analysis. It has been observed that adversarial attacks often corrupt the high-frequency components of the input image. We, therefore, propose a holistic approach for quantifying adversarial vulnerability of a sample by combining these different perspectives, i.e., degree of model's reliance on high-frequency features and the (conventional) sample-distance to the decision boundary. We demonstrate that by reliably estimating adversarial vulnerability at the sample level using the proposed holistic metric, it is possible to develop a trustworthy system where humans can be alerted about the incoming samples that are highly likely to be misclassified at test time. This is achieved with better precision when our holistic metric is used over individual measures. To further corroborate the utility of the proposed holistic approach, we perform knowledge distillation in a limited-sample setting. We observe that the student network trained with the subset of samples selected using our combined metric performs better than both the competing baselines, viz., where samples are selected randomly or based on their distances to the decision boundary.  ( 2 min )
    PyDaddy: A Python package for discovering stochastic dynamical equations from timeseries data. (arXiv:2205.02645v1 [q-bio.QM])
    Most real-world ecological dynamics, ranging from ecosystem dynamics to collective animal movement, are inherently stochastic in nature. Stochastic differential equations (SDEs) are a popular modelling framework to model dynamics with intrinsic randomness. Here, we focus on the inverse question: If one has empirically measured time-series data from some system of interest, is it possible to discover the SDE model that best describes the data. Here, we present PyDaddy (PYthon library for DAta Driven DYnamics), a toolbox to construct and analyze interpretable SDE models based on time-series data. We combine traditional approaches for data-driven SDE reconstruction with an equation learning approach, to derive symbolic equations governing the stochastic dynamics. The toolkit is presented as an open-source Python library, and consists of tools to construct and analyze SDEs. Functionality is included for visual examination of the stochastic structure of the data, guided extraction of the functional form of the SDE, and diagnosis and debugging of the underlying assumptions and the extracted model. Using simulated time-series datasets, exhibiting a wide range of dynamics, we show that PyDaddy is able to correctly identify underlying SDE models. We demonstrate the applicability of the toolkit to real-world data using a previously published movement data of a fish school. Starting from the time-series of the observed polarization of the school, pyDaddy readily discovers the SDE model governing the dynamics of group polarization. The model recovered by PyDaddy is consistent with the previous study. In summary, stochastic and noise-induced effects are central to the dynamics of many biological systems. In this context, we present an easy-to-use package to reconstruct SDEs from timeseries data.  ( 2 min )
    Compressive Ptychography using Deep Image and Generative Priors. (arXiv:2205.02397v1 [cs.CV])
    Ptychography is a well-established coherent diffraction imaging technique that enables non-invasive imaging of samples at a nanometer scale. It has been extensively used in various areas such as the defense industry or materials science. One major limitation of ptychography is the long data acquisition time due to mechanical scanning of the sample; therefore, approaches to reduce the scan points are highly desired. However, reconstructions with less number of scan points lead to imaging artifacts and significant distortions, hindering a quantitative evaluation of the results. To address this bottleneck, we propose a generative model combining deep image priors with deep generative priors. The self-training approach optimizes the deep generative neural network to create a solution for a given dataset. We complement our approach with a prior acquired from a previously trained discriminator network to avoid a possible divergence from the desired output caused by the noise in the measurements. We also suggest using the total variation as a complementary before combat artifacts due to measurement noise. We analyze our approach with numerical experiments through different probe overlap percentages and varying noise levels. We also demonstrate improved reconstruction accuracy compared to the state-of-the-art method and discuss the advantages and disadvantages of our approach.  ( 2 min )
    PI-NLF: A Proportional-Integral Approach for Non-negative Latent Factor Analysis. (arXiv:2205.02591v1 [cs.LG])
    A high-dimensional and incomplete (HDI) matrix frequently appears in various big-data-related applications, which demonstrates the inherently non-negative interactions among numerous nodes. A non-negative latent factor (NLF) model performs efficient representation learning to an HDI matrix, whose learning process mostly relies on a single latent factor-dependent, non-negative and multiplicative update (SLF-NMU) algorithm. However, an SLF-NMU algorithm updates a latent factor based on the current update increment only without appropriate considerations of past learning information, resulting in slow convergence. Inspired by the prominent success of a proportional-integral (PI) controller in various applications, this paper proposes a Proportional-Integral-incorporated Non-negative Latent Factor (PI-NLF) model with two-fold ideas: a) establishing an Increment Refinement (IR) mechanism via considering the past update increments following the principle of a PI controller; and b) designing an IR-based SLF-NMU (ISN) algorithm to accelerate the convergence rate of a resultant model. Empirical studies on four HDI datasets demonstrate that a PI-NLF model outperforms the state-of-the-art models in both computational efficiency and estimation accuracy for missing data of an HDI matrix. Hence, this study unveils the feasibility of boosting the performance of a non-negative learning algorithm through an error feedback controller.  ( 2 min )
    Model-Based Deep Learning: On the Intersection of Deep Learning and Optimization. (arXiv:2205.02640v1 [eess.SP])
    Decision making algorithms are used in a multitude of different applications. Conventional approaches for designing decision algorithms employ principled and simplified modelling, based on which one can determine decisions via tractable optimization. More recently, deep learning approaches that use highly parametric architectures tuned from data without relying on mathematical models, are becoming increasingly popular. Model-based optimization and data-centric deep learning are often considered to be distinct disciplines. Here, we characterize them as edges of a continuous spectrum varying in specificity and parameterization, and provide a tutorial-style presentation to the methodologies lying in the middle ground of this spectrum, referred to as model-based deep learning. We accompany our presentation with running examples in super-resolution and stochastic control, and show how they are expressed using the provided characterization and specialized in each of the detailed methodologies. The gains of combining model-based optimization and deep learning are demonstrated using experimental results in various applications, ranging from biomedical imaging to digital communications.  ( 2 min )
    Learning to Solve Vehicle Routing Problems: A Survey. (arXiv:2205.02453v1 [cs.LG])
    This paper provides a systematic overview of machine learning methods applied to solve NP-hard Vehicle Routing Problems (VRPs). Recently, there has been a great interest from both machine learning and operations research communities to solve VRPs either by pure learning methods or by combining them with the traditional hand-crafted heuristics. We present the taxonomy of the studies for learning paradigms, solution structures, underlying models, and algorithms. We present in detail the results of the state-of-the-art methods demonstrating their competitiveness with the traditional methods. The paper outlines the future research directions to incorporate learning-based solutions to overcome the challenges of modern transportation systems.  ( 2 min )
    COGMEN: COntextualized GNN based Multimodal Emotion recognitioN. (arXiv:2205.02455v1 [cs.CL])
    Emotions are an inherent part of human interactions, and consequently, it is imperative to develop AI systems that understand and recognize human emotions. During a conversation involving various people, a person's emotions are influenced by the other speaker's utterances and their own emotional state over the utterances. In this paper, we propose COntextualized Graph Neural Network based Multimodal Emotion recognitioN (COGMEN) system that leverages local information (i.e., inter/intra dependency between speakers) and global information (context). The proposed model uses Graph Neural Network (GNN) based architecture to model the complex dependencies (local and global information) in a conversation. Our model gives state-of-the-art (SOTA) results on IEMOCAP and MOSEI datasets, and detailed ablation experiments show the importance of modeling information at both levels.  ( 2 min )
    Contrastive Multi-view Hyperbolic Hierarchical Clustering. (arXiv:2205.02618v1 [cs.LG])
    Hierarchical clustering recursively partitions data at an increasingly finer granularity. In real-world applications, multi-view data have become increasingly important. This raises a less investigated problem, i.e., multi-view hierarchical clustering, to better understand the hierarchical structure of multi-view data. To this end, we propose a novel neural network-based model, namely Contrastive Multi-view Hyperbolic Hierarchical Clustering (CMHHC). It consists of three components, i.e., multi-view alignment learning, aligned feature similarity learning, and continuous hyperbolic hierarchical clustering. First, we align sample-level representations across multiple views in a contrastive way to capture the view-invariance information. Next, we utilize both the manifold and Euclidean similarities to improve the metric property. Then, we embed the representations into a hyperbolic space and optimize the hyperbolic embeddings via a continuous relaxation of hierarchical clustering loss. Finally, a binary clustering tree is decoded from optimized hyperbolic embeddings. Experimental results on five real-world datasets demonstrate the effectiveness of the proposed method and its components.  ( 2 min )
    GANimator: Neural Motion Synthesis from a Single Sequence. (arXiv:2205.02625v1 [cs.GR])
    We present GANimator, a generative model that learns to synthesize novel motions from a single, short motion sequence. GANimator generates motions that resemble the core elements of the original motion, while simultaneously synthesizing novel and diverse movements. Existing data-driven techniques for motion synthesis require a large motion dataset which contains the desired and specific skeletal structure. By contrast, GANimator only requires training on a single motion sequence, enabling novel motion synthesis for a variety of skeletal structures e.g., bipeds, quadropeds, hexapeds, and more. Our framework contains a series of generative and adversarial neural networks, each responsible for generating motions in a specific frame rate. The framework progressively learns to synthesize motion from random noise, enabling hierarchical control over the generated motion content across varying levels of detail. We show a number of applications, including crowd simulation, key-frame editing, style transfer, and interactive control, which all learn from a single input sequence. Code and data for this paper are at https://peizhuoli.github.io/ganimator.  ( 2 min )
    On Disentangled and Locally Fair Representations. (arXiv:2205.02673v1 [cs.LG])
    We study the problem of performing classification in a manner that is fair for sensitive groups, such as race and gender. This problem is tackled through the lens of disentangled and locally fair representations. We learn a locally fair representation, such that, under the learned representation, the neighborhood of each sample is balanced in terms of the sensitive attribute. For instance, when a decision is made to hire an individual, we ensure that the $K$ most similar hired individuals are racially balanced. Crucially, we ensure that similar individuals are found based on attributes not correlated to their race. To this end, we disentangle the embedding space into two representations. The first of which is correlated with the sensitive attribute while the second is not. We apply our local fairness objective only to the second, uncorrelated, representation. Through a set of experiments, we demonstrate the necessity of both disentangled and local fairness for obtaining fair and accurate representations. We evaluate our method on real-world settings such as predicting income and re-incarceration rate and demonstrate the advantage of our method.  ( 2 min )
    Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. (arXiv:2205.02596v1 [cs.CL])
    We present a comprehensive work on automated veracity assessment from dataset creation to developing novel methods based on Natural Language Inference (NLI), focusing on misinformation related to the COVID-19 pandemic. We first describe the construction of the novel PANACEA dataset consisting of heterogeneous claims on COVID-19 and their respective information sources. The dataset construction includes work on retrieval techniques and similarity measurements to ensure a unique set of claims. We then propose novel techniques for automated veracity assessment based on Natural Language Inference including graph convolutional networks and attention based approaches. We have carried out experiments on evidence retrieval and veracity assessment on the dataset using the proposed techniques and found them competitive with SOTA methods, and provided a detailed discussion.  ( 2 min )
    FastRE: Towards Fast Relation Extraction with Convolutional Encoder and Improved Cascade Binary Tagging Framework. (arXiv:2205.02490v1 [cs.CL])
    Recent work for extracting relations from texts has achieved excellent performance. However, most existing methods pay less attention to the efficiency, making it still challenging to quickly extract relations from massive or streaming text data in realistic scenarios. The main efficiency bottleneck is that these methods use a Transformer-based pre-trained language model for encoding, which heavily affects the training speed and inference speed. To address this issue, we propose a fast relation extraction model (FastRE) based on convolutional encoder and improved cascade binary tagging framework. Compared to previous work, FastRE employs several innovations to improve efficiency while also keeping promising performance. Concretely, FastRE adopts a novel convolutional encoder architecture combined with dilated convolution, gated unit and residual connection, which significantly reduces the computation cost of training and inference, while maintaining the satisfactory performance. Moreover, to improve the cascade binary tagging framework, FastRE first introduces a type-relation mapping mechanism to accelerate tagging efficiency and alleviate relation redundancy, and then utilizes a position-dependent adaptive thresholding strategy to obtain higher tagging accuracy and better model generalization. Experimental results demonstrate that FastRE is well balanced between efficiency and performance, and achieves 3-10x training speed, 7-15x inference speed faster, and 1/100 parameters compared to the state-of-the-art models, while the performance is still competitive.  ( 2 min )
    Optimal Algorithms for Mean Estimation under Local Differential Privacy. (arXiv:2205.02466v1 [cs.LG])
    We study the problem of mean estimation of $\ell_2$-bounded vectors under the constraint of local differential privacy. While the literature has a variety of algorithms that achieve the asymptotically optimal rates for this problem, the performance of these algorithms in practice can vary significantly due to varying (and often large) hidden constants. In this work, we investigate the question of designing the protocol with the smallest variance. We show that PrivUnit (Bhowmick et al. 2018) with optimized parameters achieves the optimal variance among a large family of locally private randomizers. To prove this result, we establish some properties of local randomizers, and use symmetrization arguments that allow us to write the optimal randomizer as the optimizer of a certain linear program. These structural results, which should extend to other problems, then allow us to show that the optimal randomizer belongs to the PrivUnit family. We also develop a new variant of PrivUnit based on the Gaussian distribution which is more amenable to mathematical analysis and enjoys the same optimality guarantees. This allows us to establish several useful properties on the exact constants of the optimal error as well as to numerically estimate these constants.  ( 2 min )
    dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training. (arXiv:2205.02473v1 [cs.DC])
    Distributed training using multiple devices (i.e., GPU servers) has been widely adopted for learning DNN models over large datasets. However, the performance of large-scale distributed training tends to be far from linear speed-up in practice. Given the complexity of distributed systems, it is challenging to identify the root cause(s) of inefficiency and exercise effective performance optimizations when unexpected low training speed occurs. To date, there exists no software tool which diagnoses performance issues and helps expedite distributed DNN training, while the training can be run using different machine learning frameworks. This paper proposes dPRO, a toolkit that includes: (1) an efficient profiler that collects runtime traces of distributed DNN training across multiple frameworks, especially fine-grained communication traces, and constructs global data flow graphs including detailed communication operations for accurate replay; (2) an optimizer that effectively identifies performance bottlenecks and explores optimization strategies (from computation, communication and memory aspects) for training acceleration. We implement dPRO on multiple deep learning frameworks (PyTorch, TensorFlow, MXNet) and representative communication schemes (AllReduce and Parameter Server architecture). Extensive experiments show that dPRO predicts performance of distributed training in various settings with<5% errors in most cases and finds optimization strategies with up to87.1%speed-up over the baselines.  ( 2 min )
    A Temporal-Pattern Backdoor Attack to Deep Reinforcement Learning. (arXiv:2205.02589v1 [cs.LG])
    Deep reinforcement learning (DRL) has made significant achievements in many real-world applications. But these real-world applications typically can only provide partial observations for making decisions due to occlusions and noisy sensors. However, partial state observability can be used to hide malicious behaviors for backdoors. In this paper, we explore the sequential nature of DRL and propose a novel temporal-pattern backdoor attack to DRL, whose trigger is a set of temporal constraints on a sequence of observations rather than a single observation, and effect can be kept in a controllable duration rather than in the instant. We validate our proposed backdoor attack to a typical job scheduling task in cloud computing. Numerous experimental results show that our backdoor can achieve excellent effectiveness, stealthiness, and sustainability. Our backdoor's average clean data accuracy and attack success rate can reach 97.8% and 97.5%, respectively.  ( 2 min )
    Assistive Recipe Editing through Critiquing. (arXiv:2205.02454v1 [cs.CL])
    There has recently been growing interest in the automatic generation of cooking recipes that satisfy some form of dietary restrictions, thanks in part to the availability of online recipe data. Prior studies have used pre-trained language models, or relied on small paired recipe data (e.g., a recipe paired with a similar one that satisfies a dietary constraint). However, pre-trained language models generate inconsistent or incoherent recipes, and paired datasets are not available at scale. We address these deficiencies with RecipeCrit, a hierarchical denoising auto-encoder that edits recipes given ingredient-level critiques. The model is trained for recipe completion to learn semantic relationships within recipes. Our work's main innovation is our unsupervised critiquing module that allows users to edit recipes by interacting with the predicted ingredients; the system iteratively rewrites recipes to satisfy users' feedback. Experiments on the Recipe1M recipe dataset show that our model can more effectively edit recipes compared to strong language-modeling baselines, creating recipes that satisfy user constraints and are more correct, serendipitous, coherent, and relevant as measured by human judges.  ( 2 min )
    Spot-adaptive Knowledge Distillation. (arXiv:2205.02399v1 [cs.CV])
    Knowledge distillation (KD) has become a well established paradigm for compressing deep neural networks. The typical way of conducting knowledge distillation is to train the student network under the supervision of the teacher network to harness the knowledge at one or multiple spots (i.e., layers) in the teacher network. The distillation spots, once specified, will not change for all the training samples, throughout the whole distillation process. In this work, we argue that distillation spots should be adaptive to training samples and distillation epochs. We thus propose a new distillation strategy, termed spot-adaptive KD (SAKD), to adaptively determine the distillation spots in the teacher network per sample, at every training iteration during the whole distillation period. As SAKD actually focuses on "where to distill" instead of "what to distill" that is widely investigated by most existing works, it can be seamlessly integrated into existing distillation methods to further improve their performance. Extensive experiments with 10 state-of-the-art distillers are conducted to demonstrate the effectiveness of SAKD for improving their distillation performance, under both homogeneous and heterogeneous distillation settings. Code is available at https://github.com/zju-vipa/spot-adaptive-pytorch  ( 2 min )
    Soft and Hard Constrained Parametric Generative Schemes for Encoding and Synthesizing Airfoils. (arXiv:2205.02458v1 [physics.flu-dyn])
    Traditional airfoil parametric technique has significant limitation in modern aerodynamic optimization design.There is a strong demand for developing a parametric method with good intuitiveness, flexibility and representative accuracy. In this paper, two parametric generative schemes based on deep learning methods are proposed to represent the complicate design space under specific constraints. 1. Soft-constrained scheme: The CVAE-based model trains geometric constraints as part of the network and can provide constrained airfoil synthesis; 2. Hard-constrained scheme: The VAE-based model serves to generate diverse airfoils, while an FFD-based technique projects the generated airfoils to the final airfoils satisfying the given constraints. The statistical results show that the reconstructed airfoils are accurate and smooth without extra filters. The soft constrained scheme tend to synthesize and explore airfoils efficiently and effectively, concentrating to the reference airfoil in both geometry space and objective space. The constraints will loose for a little bit because the inherent property of the model. The hard constrained scheme tend to generate and explore airfoils in a wider range for both geometry space and objective space, and the distribution in objective space is closer to normal distribution. The synthesized airfoils through this scheme strictly conform with constraints, though the projection may produce some odd airfoil shapes.  ( 2 min )
    Response Component Analysis for Sea State Estimation Using Artificial Neural Networks and Vessel Response Spectral Data. (arXiv:2205.02375v1 [cs.LG])
    The use of the `ship as a wave buoy analogy' (SAWB) provides a novel means to estimate sea states, where relationships are established between causal wave properties and vessel motion response information. This study focuses on a model-free machine learning approach to SAWB-based sea state estimation (SSE), using neural networks (NNs) to map vessel response spectral data to statistical wave properties. Results showed a strong correlation between heave responses and significant wave height estimates, whilst the accuracy of mean wave period and wave heading predictions were observed to improve considerably when data from multiple vessel degrees of freedom (DOFs) was utilized. Overall, 3-DOF (heave, pitch and roll) NNs for SSE were shown to perform well when compared to existing SSE approaches that use similar simulation setups. Given the information-dense statistical representation of vessel motion responses in spectral form, as well as the ability of NNs to effectively model complex relationships between variables, the designed SSE method shows promise for future adaptation to mobile SSE systems using the SAWB approach.  ( 2 min )
    Alignahead: Online Cross-Layer Knowledge Extraction on Graph Neural Networks. (arXiv:2205.02468v1 [cs.LG])
    Existing knowledge distillation methods on graph neural networks (GNNs) are almost offline, where the student model extracts knowledge from a powerful teacher model to improve its performance. However, a pre-trained teacher model is not always accessible due to training cost, privacy, etc. In this paper, we propose a novel online knowledge distillation framework to resolve this problem. Specifically, each student GNN model learns the extracted local structure from another simultaneously trained counterpart in an alternating training procedure. We further develop a cross-layer distillation strategy by aligning ahead one student layer with the layer in different depth of another student model, which theoretically makes the structure information spread over all layers. Experimental results on five datasets including PPI, Coauthor-CS/Physics and Amazon-Computer/Photo demonstrate that the student performance is consistently boosted in our collaborative training framework without the supervision of a pre-trained teacher model. In addition, we also find that our alignahead technique can accelerate the model convergence speed and its effectiveness can be generally improved by increasing the student numbers in training. Code is available: https://github.com/GuoJY-eatsTG/Alignahead  ( 2 min )
    DeepExtrema: A Deep Learning Approach for Forecasting Block Maxima in Time Series Data. (arXiv:2205.02441v1 [cs.LG])
    Accurate forecasting of extreme values in time series is critical due to the significant impact of extreme events on human and natural systems. This paper presents DeepExtrema, a novel framework that combines a deep neural network (DNN) with generalized extreme value (GEV) distribution to forecast the block maximum value of a time series. Implementing such a network is a challenge as the framework must preserve the inter-dependent constraints among the GEV model parameters even when the DNN is initialized. We describe our approach to address this challenge and present an architecture that enables both conditional mean and quantile prediction of the block maxima. The extensive experiments performed on both real-world and synthetic data demonstrated the superiority of DeepExtrema compared to other baseline methods.  ( 2 min )
    A Deep Learning Approach to Dst Index Prediction. (arXiv:2205.02447v1 [cs.LG])
    The disturbance storm time (Dst) index is an important and useful measurement in space weather research. It has been used to characterize the size and intensity of a geomagnetic storm. A negative Dst value means that the Earth's magnetic field is weakened, which happens during storms. In this paper, we present a novel deep learning method, called the Dst Transformer, to perform short-term, 1-6 hour ahead, forecasting of the Dst index based on the solar wind parameters provided by the NASA Space Science Data Coordinated Archive. The Dst Transformer combines a multi-head attention layer with Bayesian inference, which is capable of quantifying both aleatoric uncertainty and epistemic uncertainty when making Dst predictions. Experimental results show that the proposed Dst Transformer outperforms related machine learning methods in terms of the root mean square error and R-squared. Furthermore, the Dst Transformer can produce both data and model uncertainty quantification results, which can not be done by the existing methods. To our knowledge, this is the first time that Bayesian deep learning has been used for Dst index forecasting.  ( 2 min )
    Uncertainty-Based Non-Parametric Active Peak Detection. (arXiv:2205.02376v1 [cs.IT])
    Active, non-parametric peak detection is considered. As a use case, active source localization is examined and an uncertainty-based sampling scheme algorithm to effectively localize the peak from a few energy measurements is designed. It is shown that under very mild conditions, the source localization error with $m$ actively chosen energy measurements scales as $O(\log^2 m/m)$. Numerically, it is shown that in low-sample regimes, the proposed method enjoys superior performance on several types of data and outperforms the state-of-the-art passive source localization approaches and in the low sample regime, can outperform greedy methods as well.  ( 2 min )
    GitRank: A Framework to Rank GitHub Repositories. (arXiv:2205.02360v1 [cs.SE])
    Open-source repositories provide wealth of information and are increasingly being used to build artificial intelligence (AI) based systems to solve problems in software engineering. Open-source repositories could be of varying quality levels, and bad-quality repositories could degrade performance of these systems. Evaluating quality of open-source repositories, which is not available directly on code hosting sites such as GitHub, is thus important. In this hackathon, we utilize known code quality measures and GrimoireLab toolkit to implement a framework, named GitRank, to rank open-source repositories on three different criteria. We discuss our findings and preliminary evaluation in this hackathon report.  ( 2 min )
    KGTuner: Efficient Hyper-parameter Search for Knowledge Graph Learning. (arXiv:2205.02460v1 [cs.LG])
    While hyper-parameters (HPs) are important for knowledge graph (KG) learning, existing methods fail to search them efficiently. To solve this problem, we first analyze the properties of different HPs and measure the transfer ability from small subgraph to the full graph. Based on the analysis, we propose an efficient two-stage search algorithm KGTuner, which efficiently explores HP configurations on small subgraph at the first stage and transfers the top-performed configurations for fine-tuning on the large full graph at the second stage. Experiments show that our method can consistently find better HPs than the baseline algorithms within the same time budget, which achieves {9.1\%} average relative improvement for four embedding models on the large-scale KGs in open graph benchmark.  ( 2 min )
    Multi-Graph based Multi-Scenario Recommendation in Large-scale Online Video Services. (arXiv:2205.02446v1 [cs.AI])
    Recently, industrial recommendation services have been boosted by the continual upgrade of deep learning methods. However, they still face de-biasing challenges such as exposure bias and cold-start problem, where circulations of machine learning training on human interaction history leads algorithms to repeatedly suggest exposed items while ignoring less-active ones. Additional problems exist in multi-scenario platforms, e.g. appropriate data fusion from subsidiary scenarios, which we observe could be alleviated through graph structured data integration via message passing. In this paper, we present a multi-graph structured multi-scenario recommendation solution, which encapsulates interaction data across scenarios with multi-graph and obtains representation via graph learning. Extensive offline and online experiments on real-world datasets are conducted where the proposed method demonstrates an increase of 0.63% and 0.71% in CTR and Video Views per capita on new users over deployed set of baselines and outperforms regular method in increasing the number of outer-scenario videos by 25% and video watches by 116%, validating its superiority in activating cold videos and enriching target recommendation.  ( 2 min )
    Convolutional and Residual Networks Provably Contain Lottery Tickets. (arXiv:2205.02343v1 [cs.LG])
    The Lottery Ticket Hypothesis continues to have a profound practical impact on the quest for small scale deep neural networks that solve modern deep learning tasks at competitive performance. These lottery tickets are identified by pruning large randomly initialized neural networks with architectures that are as diverse as their applications. Yet, theoretical insights that attest their existence have been mostly focused on deep fully-connected feed forward networks with ReLU activation functions. We prove that also modern architectures consisting of convolutional and residual layers that can be equipped with almost arbitrary activation functions can contain lottery tickets with high probability.  ( 2 min )
    KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language. (arXiv:2205.02364v1 [cs.CL])
    This research developed a Kencorpus Swahili Question Answering Dataset KenSwQuAD from raw data of Swahili language, which is a low resource language predominantly spoken in Eastern African and also has speakers in other parts of the world. Question Answering datasets are important for machine comprehension of natural language processing tasks such as internet search and dialog systems. However, before such machine learning systems can perform these tasks, they need training data such as the gold standard Question Answering (QA) set that is developed in this research. The research engaged annotators to formulate question answer pairs from Swahili texts that had been collected by the Kencorpus project, a Kenyan languages corpus that collected data from three Kenyan languages. The total Swahili data collection had 2,585 texts, out of which we annotated 1,445 story texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts was subjected to re-evaluation by different annotators who confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to machine learning on the question answering task confirmed that the dataset can be used for such practical tasks. The research therefore developed KenSwQuAD, a question-answer dataset for Swahili that is useful to the natural language processing community who need training and gold standard sets for their machine learning applications. The research also contributed to the resourcing of the Swahili language which is important for communication around the globe. Updating this set and providing similar sets for other low resource languages is an important research area that is worthy of further research.  ( 3 min )
    Machine Learning Operations (MLOps): Overview, Definition, and Architecture. (arXiv:2205.02302v1 [cs.LG])
    The final goal of all industrial machine learning (ML) projects is to develop ML products and rapidly bring them into production. However, it is highly challenging to automate and operationalize ML products and thus many ML endeavors fail to deliver on their expectations. The paradigm of Machine Learning Operations (MLOps) addresses this issue. MLOps includes several aspects, such as best practices, sets of concepts, and development culture. However, MLOps is still a vague term and its consequences for researchers and professionals are ambiguous. To address this gap, we conduct mixed-method research, including a literature review, a tool review, and expert interviews. As a result of these investigations, we provide an aggregated overview of the necessary principles, components, and roles, as well as the associated architecture and workflows. Furthermore, we furnish a definition of MLOps and highlight open challenges in the field. Finally, this work provides guidance for ML researchers and practitioners who want to automate and operate their ML products with a designated set of technologies.  ( 2 min )
    Knowledge Distillation of Russian Language Models with Reduction of Vocabulary. (arXiv:2205.02340v1 [cs.CL])
    Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary. Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from $17\times$ to $49\times$, while maintaining quality of $1.7\times$ compressed student with the full-sized vocabulary, but reduced number of Transformer layers only. We make our code and distilled models available.  ( 2 min )
    Original or Translated? A Causal Analysis of the Impact of Translationese on Machine Translation Performance. (arXiv:2205.02293v1 [cs.CL])
    Human-translated text displays distinct features from naturally written text in the same language. This phenomena, known as translationese, has been argued to confound the machine translation (MT) evaluation. Yet, we find that existing work on translationese neglects some important factors and the conclusions are mostly correlational but not causal. In this work, we collect CausalMT, a dataset where the MT training data are also labeled with the human translation directions. We inspect two critical factors, the train-test direction match (whether the human translation directions in the training and test sets are aligned), and data-model direction match (whether the model learns in the same direction as the human translation direction in the dataset). We show that these two factors have a large causal effect on the MT performance, in addition to the test-model direction mismatch highlighted by existing work on the impact of translationese. In light of our findings, we provide a set of suggestions for MT training and evaluation. Our code and data are at https://github.com/EdisonNi-hku/CausalMT  ( 2 min )
    Equity and Fairness of Bayesian Knowledge Tracing. (arXiv:2205.02333v1 [cs.LG])
    We consider the equity and fairness of curricula derived from Knowledge Tracing models. We begin by defining a unifying notion of an equitable tutoring system as a system that achieves maximum possible knowledge in minimal time for each student interacting with it. Realizing perfect equity requires tutoring systems that can provide individualized curricula per student. In particular, we investigate the design of equitable tutoring systems that derive their curricula from Knowledge Tracing models. We first show that many existing models, including classical Bayesian Knowledge Tracing (BKT) and Deep Knowledge Tracing (DKT), and their derived curricula can fall short of achieving equitable tutoring. To overcome this issue, we then propose a novel model, Bayesian-Bayesian Knowledge Tracing (BBKT), that naturally enables online individualization and, thereby, more equitable tutoring. We demonstrate that curricula derived from our model are more effective and equitable than those derived from classical BKT models. Furthermore, we highlight that improving models with a focus on the fairness of next-step predictions might be insufficient to develop equitable tutoring systems.  ( 2 min )
    Most Activation Functions Can Win the Lottery Without Excessive Depth. (arXiv:2205.02321v1 [cs.LG])
    The strong lottery ticket hypothesis has highlighted the potential for training deep neural networks by pruning, which has inspired interesting practical and theoretical insights into how neural networks can represent functions. For networks with ReLU activation functions, it has been proven that a target network with depth $L$ can be approximated by the subnetwork of a randomly initialized neural network that has double the target's depth $2L$ and is wider by a logarithmic factor. We show that a depth $L+1$ network is sufficient. This result indicates that we can expect to find lottery tickets at realistic, commonly used depths while only requiring logarithmic overparametrization. Our novel construction approach applies to a large class of activation functions and is not limited to ReLUs.  ( 2 min )
    An Adaptive Incremental Gradient Method With Support for Non-Euclidean Norms. (arXiv:2205.02273v1 [math.OC])
    Stochastic variance reduced methods have shown strong performance in solving finite-sum problems. However, these methods usually require the users to manually tune the step-size, which is time-consuming or even infeasible for some large-scale optimization tasks. To overcome the problem, we propose and analyze several novel adaptive variants of the popular SAGA algorithm. Eventually, we design a variant of Barzilai-Borwein step-size which is tailored for the incremental gradient method to ensure memory efficiency and fast convergence. We establish its convergence guarantees under general settings that allow non-Euclidean norms in the definition of smoothness and the composite objectives, which cover a broad range of applications in machine learning. We improve the analysis of SAGA to support non-Euclidean norms, which fills the void of existing work. Numerical experiments on standard datasets demonstrate a competitive performance of the proposed algorithm compared with existing variance-reduced methods and their adaptive variants.  ( 2 min )
    Language Models in the Loop: Incorporating Prompting into Weak Supervision. (arXiv:2205.02318v1 [cs.LG])
    We propose a new strategy for applying large pre-trained language models to novel tasks when labeled training data is limited. Rather than apply the model in a typical zero-shot or few-shot fashion, we treat the model as the basis for labeling functions in a weak supervision framework. To create a classifier, we first prompt the model to answer multiple distinct queries about an example and define how the possible responses should be mapped to votes for labels and abstentions. We then denoise these noisy label sources using the Snorkel system and train an end classifier with the resulting training data. Our experimental evaluation shows that prompting large language models within a weak supervision framework can provide significant gains in accuracy. On the WRENCH weak supervision benchmark, this approach can significantly improve over zero-shot performance, an average 19.5% reduction in errors. We also find that this approach produces classifiers with comparable or superior accuracy to those trained from hand-engineered rules.  ( 2 min )
    Multivariate Prediction Intervals for Random Forests. (arXiv:2205.02260v1 [stat.ML])
    Accurate uncertainty estimates can significantly improve the performance of iterative design of experiments, as in Sequential and Reinforcement learning. For many such problems in engineering and the physical sciences, the design task depends on multiple correlated model outputs as objectives and/or constraints. To better solve these problems, we propose a recalibrated bootstrap method to generate multivariate prediction intervals for bagged models and show that it is well-calibrated. We apply the recalibrated bootstrap to a simulated sequential learning problem with multiple objectives and show that it leads to a marked decrease in the number of iterations required to find a satisfactory candidate. This indicates that the recalibrated bootstrap could be a valuable tool for practitioners using machine learning to optimize systems with multiple competing targets.  ( 2 min )
    Fine-Grained Address Segmentation for Attention-Based Variable-Degree Prefetching. (arXiv:2205.02269v1 [cs.AR])
    Machine learning algorithms have shown potential to improve prefetching performance by accurately predicting future memory accesses. Existing approaches are based on the modeling of text prediction, considering prefetching as a classification problem for sequence prediction. However, the vast and sparse memory address space leads to large vocabulary, which makes this modeling impractical. The number and order of outputs for multiple cache line prefetching are also fundamentally different from text prediction. We propose TransFetch, a novel way to model prefetching. To reduce vocabulary size, we use fine-grained address segmentation as input. To predict unordered sets of future addresses, we use delta bitmaps for multiple outputs. We apply an attention-based network to learn the mapping between input and output. Prediction experiments demonstrate that address segmentation achieves 26% - 36% higher F1-score than delta inputs and 15% - 24% higher F1-score than page & offset inputs for SPEC 2006, SPEC 2017, and GAP benchmarks. Simulation results show that TransFetch achieves 38.75% IPC improvement compared with no prefetching, outperforming the best-performing rule-based prefetcher BOP by 10.44%, and ML-based prefetcher Voyager by 6.64%.  ( 2 min )
    Group-Invariant Quantum Machine Learning. (arXiv:2205.02261v1 [quant-ph])
    Quantum Machine Learning (QML) models are aimed at learning from data encoded in quantum states. Recently, it has been shown that models with little to no inductive biases (i.e., with no assumptions about the problem embedded in the model) are likely to have trainability and generalization issues, especially for large problem sizes. As such, it is fundamental to develop schemes that encode as much information as available about the problem at hand. In this work we present a simple, yet powerful, framework where the underlying invariances in the data are used to build QML models that, by construction, respect those symmetries. These so-called group-invariant models produce outputs that remain invariant under the action of any element of the symmetry group $\mathfrak{G}$ associated to the dataset. We present theoretical results underpinning the design of $\mathfrak{G}$-invariant models, and exemplify their application through several paradigmatic QML classification tasks including cases when $\mathfrak{G}$ is a continuous Lie group and also when it is a discrete symmetry group. Notably, our framework allows us to recover, in an elegant way, several well known algorithms for the literature, as well as to discover new ones. Taken together, we expect that our results will help pave the way towards a more geometric and group-theoretic approach to QML model design.  ( 2 min )
    Learning Individual Interactions from Population Dynamics with Discrete-Event Simulation Model. (arXiv:2205.02332v1 [cs.LG])
    The abundance of data affords researchers to pursue more powerful computational tools to learn the dynamics of complex system, such as neural networks, engineered systems and social networks. Traditional machine learning approaches capture complex system dynamics either with dynamic Bayesian networks and state space models, which is hard to scale because it is non-trivial to prescribe the dynamics with a sparse graph or a system of differential equations; or a deep neural networks, where the distributed representation of the learned dynamics is hard to interpret. In this paper, we will explore the possibility of learning a discrete-event simulation representation of complex system dynamics assuming multivariate normal distribution of the state variables, based on the observation that many complex system dynamics can be decomposed into a sequence of local interactions, which individually change the system state only minimally but in sequence generate complex and diverse dynamics. Our results show that the algorithm can data-efficiently capture complex network dynamics in several fields with meaningful events.  ( 2 min )
    Minimum Cost Intervention Design for Causal Effect Identification. (arXiv:2205.02232v1 [cs.LG])
    Pearl's do calculus is a complete axiomatic approach to learn the identifiable causal effects from observational data. When such an effect is not identifiable, it is necessary to perform a collection of often costly interventions in the system to learn the causal effect. In this work, we consider the problem of designing the collection of interventions with the minimum cost to identify the desired effect. First, we prove that this problem is NP-hard, and subsequently propose an algorithm that can either find the optimal solution or a logarithmic-factor approximation of it. This is done by establishing a connection between our problem and the minimum hitting set problem. Additionally, we propose several polynomial-time heuristic algorithms to tackle the computational complexity of the problem. Although these algorithms could potentially stumble on sub-optimal solutions, our simulations show that they achieve small regrets on random graphs.  ( 2 min )
  • Open

    Fitting an immersed submanifold to data via Sussmann's orbit theorem. (arXiv:2204.01119v2 [cs.LG] UPDATED)
    This paper describes an approach for fitting an immersed submanifold of a finite-dimensional Euclidean space to random samples. The reconstruction mapping from the ambient space to the desired submanifold is implemented as a composition of an encoder that maps each point to a tuple of (positive or negative) times and a decoder given by a composition of flows along finitely many vector fields starting from a fixed initial point. The encoder supplies the times for the flows. The encoder-decoder map is obtained by empirical risk minimization, and a high-probability bound is given on the excess risk relative to the minimum expected reconstruction error over a given class of encoder-decoder maps. The proposed approach makes fundamental use of Sussmann's orbit theorem, which guarantees that the image of the reconstruction map is indeed contained in an immersed submanifold.  ( 2 min )
    Tracking the risk of a deployed model and detecting harmful distribution shifts. (arXiv:2110.06177v4 [stat.ML] UPDATED)
    When deployed in the real world, machine learning models inevitably encounter changes in the data distribution, and certain -- but not all -- distribution shifts could result in significant performance degradation. In practice, it may make sense to ignore benign shifts, under which the performance of a deployed model does not degrade substantially, making interventions by a human expert (or model retraining) unnecessary. While several works have developed tests for distribution shifts, these typically either use non-sequential methods, or detect arbitrary shifts (benign or harmful), or both. We argue that a sensible method for firing off a warning has to both (a) detect harmful shifts while ignoring benign ones, and (b) allow continuous monitoring of model performance without increasing the false alarm rate. In this work, we design simple sequential tools for testing if the difference between source (training) and target (test) distributions leads to a significant increase in a risk function of interest, like accuracy or calibration. Recent advances in constructing time-uniform confidence sequences allow efficient aggregation of statistical evidence accumulated during the tracking process. The designed framework is applicable in settings where (some) true labels are revealed after the prediction is performed, or when batches of labels become available in a delayed fashion. We demonstrate the efficacy of the proposed framework through an extensive empirical study on a collection of simulated and real datasets.  ( 2 min )
    Using Time-Series Privileged Information for Provably Efficient Learning of Prediction Models. (arXiv:2110.14993v2 [cs.LG] UPDATED)
    We study prediction of future outcomes with supervised models that use privileged information during learning. The privileged information comprises samples of time series observed between the baseline time of prediction and the future outcome; this information is only available at training time which differs from the traditional supervised learning. Our question is when using this privileged data leads to more sample-efficient learning of models that use only baseline data for predictions at test time. We give an algorithm for this setting and prove that when the time series are drawn from a non-stationary Gaussian-linear dynamical system of fixed horizon, learning with privileged information is more efficient than learning without it. On synthetic data, we test the limits of our algorithm and theory, both when our assumptions hold and when they are violated. On three diverse real-world datasets, we show that our approach is generally preferable to classical learning, particularly when data is scarce. Finally, we relate our estimator to a distillation approach both theoretically and empirically.  ( 2 min )
    Local Latin Hypercube Refinement for Multi-objective Design Uncertainty Optimization. (arXiv:2108.08890v2 [stat.ML] UPDATED)
    Optimizing the reliability and the robustness of a design is important but often unaffordable due to high sample requirements. Surrogate models based on statistical and machine learning methods are used to increase the sample efficiency. However, for higher dimensional or multi-modal systems, surrogate models may also require a large amount of samples to achieve good results. We propose a sequential sampling strategy for the surrogate based solution of multi-objective reliability based robust design optimization problems. Proposed local Latin hypercube refinement (LoLHR) strategy is model-agnostic and can be combined with any surrogate model because there is no free lunch but possibly a budget one. The proposed method is compared to stationary sampling as well as other proposed strategies from the literature. Gaussian process and support vector regression are both used as surrogate models. Empirical evidence is presented, showing that LoLHR achieves on average better results compared to other surrogate based strategies on the tested examples.  ( 2 min )
    Dropout Strikes Back: Improved Uncertainty Estimation via Diversity Sampling. (arXiv:2003.03274v3 [cs.LG] UPDATED)
    Uncertainty estimation for machine learning models is of high importance in many scenarios such as constructing the confidence intervals for model predictions and detection of out-of-distribution or adversarially generated points. In this work, we show that modifying the sampling distributions for dropout layers in neural networks improves the quality of uncertainty estimation. Our main idea consists of two main steps: computing data-driven correlations between neurons and generating samples, which include maximally diverse neurons. In a series of experiments on simulated and real-world data, we demonstrate that the diversification via determinantal point processes-based sampling achieves state-of-the-art results in uncertainty estimation for regression and classification tasks. An important feature of our approach is that it does not require any modification to the models or training procedures, allowing straightforward application to any deep learning model with dropout layers.  ( 2 min )
    Partition MCMC for inference on acyclic digraphs. (arXiv:1504.05006v2 [stat.ML] CROSS LISTED)
    Acyclic digraphs are the underlying representation of Bayesian networks, a widely used class of probabilistic graphical models. Learning the underlying graph from data is a way of gaining insights about the structural properties of a domain. Structure learning forms one of the inference challenges of statistical graphical models. MCMC methods, notably structure MCMC, to sample graphs from the posterior distribution given the data are probably the only viable option for Bayesian model averaging. Score modularity and restrictions on the number of parents of each node allow the graphs to be grouped into larger collections, which can be scored as a whole to improve the chain's convergence. Current examples of algorithms taking advantage of grouping are the biased order MCMC, which acts on the alternative space of permuted triangular matrices, and non ergodic edge reversal moves. Here we propose a novel algorithm, which employs the underlying combinatorial structure of DAGs to define a new grouping. As a result convergence is improved compared to structure MCMC, while still retaining the property of producing an unbiased sample. Finally the method can be combined with edge reversal moves to improve the sampler further.  ( 2 min )
    Hardness of Noise-Free Learning for Two-Hidden-Layer Neural Networks. (arXiv:2202.05258v2 [cs.LG] UPDATED)
    We give superpolynomial statistical query (SQ) lower bounds for learning two-hidden-layer ReLU networks with respect to Gaussian inputs in the standard (noise-free) model. No general SQ lower bounds were known for learning ReLU networks of any depth in this setting: previous SQ lower bounds held only for adversarial noise models (agnostic learning) or restricted models such as correlational SQ. Prior work hinted at the impossibility of our result: Vempala and Wilmes showed that general SQ lower bounds cannot apply to any real-valued family of functions that satisfies a simple non-degeneracy condition. To circumvent their result, we refine a lifting procedure due to Daniely and Vardi that reduces Boolean PAC learning problems to Gaussian ones. We show how to extend their technique to other learning models and, in many well-studied cases, obtain a more efficient reduction. As such, we also prove new cryptographic hardness results for PAC learning two-hidden-layer ReLU networks, as well as new lower bounds for learning constant-depth ReLU networks from label queries.  ( 2 min )
    A Change Dynamic Model for the Online Detection of Gradual Change. (arXiv:2205.01054v3 [stat.ML] UPDATED)
    Changes in the statistical properties of a stochastic process are typically assumed to occur via change-points, which demark instantaneous moments of complete and total change in process behavior. In cases where these transitions occur gradually, this assumption can result in a reduced ability to properly identify and respond to process change. With this observation in mind, we introduce a novel change-dynamic model for the online detection of gradual change in a Bayesian framework, in which change-points are used within a hierarchical model to indicate moments of gradual change onset or termination. We apply this model to synthetic data and EEG readings drawn during epileptic seizure, where we find our change-dynamic model can enable faster and more accurate identification of gradual change than traditional change-point models allow.  ( 2 min )
    Linear Discriminant Analysis with High-dimensional Mixed Variables. (arXiv:2112.07145v2 [stat.ME] UPDATED)
    Datasets containing both categorical and continuous variables are frequently encountered in many areas, and with the rapid development of modern measurement technologies, the dimensions of these variables can be very high. Despite the recent progress made in modelling high-dimensional data for continuous variables, there is a scarcity of methods that can deal with a mixed set of variables. To fill this gap, this paper develops a novel approach for classifying high-dimensional observations with mixed variables. Our framework builds on a location model, in which the distributions of the continuous variables conditional on categorical ones are assumed Gaussian. We overcome the challenge of having to split data into exponentially many cells, or combinations of the categorical variables, by kernel smoothing, and provide new perspectives for its bandwidth choice to ensure an analogue of Bochner's Lemma, which is different to the usual bias-variance tradeoff. We show that the two sets of parameters in our model can be separately estimated and provide penalized likelihood for their estimation. Results on the estimation accuracy and the misclassification rates are established, and the competitive performance of the proposed classifier is illustrated by extensive simulation and real data studies.  ( 2 min )
    Approximate exploitability: Learning a best response in large games. (arXiv:2004.09677v4 [cs.LG] UPDATED)
    Researchers have demonstrated that neural networks are vulnerable to adversarial examples and subtle environment changes, both of which one can view as a form of distribution shift. To humans, the resulting errors can look like blunders, eroding trust in these agents. In prior games research, agent evaluation often focused on the in-practice game outcomes. While valuable, such evaluation typically fails to evaluate robustness to worst-case outcomes. Prior research in computer poker has examined how to assess such worst-case performance, both exactly and approximately. Unfortunately, exact computation is infeasible with larger domains, and existing approximations rely on poker-specific knowledge. We introduce ISMCTS-BR, a scalable search-based deep reinforcement learning algorithm for learning a best response to an agent, thereby approximating worst-case performance. We demonstrate the technique in several two-player zero-sum games against a variety of agents, including several AlphaZero-based agents.  ( 2 min )
    Addendum on the scoring of Gaussian directed acyclic graphical models. (arXiv:1402.6863v4 [stat.ML] CROSS LISTED)
    We provide a correction to the expression for scoring Gaussian directed acyclic graphical models derived in Geiger and Heckerman [Ann. Statist. 30 (2002) 1414-1440] and discuss how to evaluate the score efficiently.  ( 2 min )
    Resilience of Bayesian Layer-Wise Explanations under Adversarial Attacks. (arXiv:2102.11010v3 [cs.LG] UPDATED)
    We consider the problem of the stability of saliency-based explanations of Neural Network predictions under adversarial attacks in a classification task. Saliency interpretations of deterministic Neural Networks are remarkably brittle even when the attacks fail, i.e. for attacks that do not change the classification label. We empirically show that interpretations provided by Bayesian Neural Networks are considerably more stable under adversarial perturbations of the inputs and even under direct attacks to the explanations. By leveraging recent results, we also provide a theoretical explanation of this result in terms of the geometry of the data manifold. Additionally, we discuss the stability of the interpretations of high level representations of the inputs in the internal layers of a Network. Our results demonstrate that Bayesian methods, in addition to being more robust to adversarial attacks, have the potential to provide more stable and interpretable assessments of Neural Network predictions.  ( 2 min )
    Communication-Efficient Device Scheduling for Federated Learning Using Stochastic Optimization. (arXiv:2201.07912v2 [cs.LG] UPDATED)
    Federated learning (FL) is a useful tool in distributed machine learning that utilizes users' local datasets in a privacy-preserving manner. When deploying FL in a constrained wireless environment; however, training models in a time-efficient manner can be a challenging task due to intermittent connectivity of devices, heterogeneous connection quality, and non-i.i.d. data. In this paper, we provide a novel convergence analysis of non-convex loss functions using FL on both i.i.d. and non-i.i.d. datasets with arbitrary device selection probabilities for each round. Then, using the derived convergence bound, we use stochastic optimization to develop a new client selection and power allocation algorithm that minimizes a function of the convergence bound and the average communication time under a transmit power constraint. We find an analytical solution to the minimization problem. One key feature of the algorithm is that knowledge of the channel statistics is not required and only the instantaneous channel state information needs to be known. Using the FEMNIST and CIFAR-10 datasets, we show through simulations that the communication time can be significantly decreased using our algorithm, compared to uniformly random participation.  ( 2 min )
    Riemannian classification of EEG signals with missing values. (arXiv:2110.10011v2 [cs.HC] UPDATED)
    This paper proposes a strategy to handle missing data for the classification of electroencephalograms using covariance matrices. It relies on the observed-data likelihood within an expectation-maximization algorithm. This approach is compared to two existing state-of-the-art methods: (i) covariance matrices computed with imputed data; (ii) Riemannian averages of partially observed covariance matrix. All approaches are combined with the minimum distance to Riemannian mean classifier and applied to a classification task of two widely known paradigms of brain-computer interfaces. In addition to be applicable for a wider range of missing data scenarios, the proposed strategy generally performs better than other methods on the considered real EEG data.  ( 2 min )
    Non-Euclidean Differentially Private Stochastic Convex Optimization: Optimal Rates in Linear Time. (arXiv:2103.01278v2 [cs.LG] UPDATED)
    Differentially private (DP) stochastic convex optimization (SCO) is a fundamental problem, where the goal is to approximately minimize the population risk with respect to a convex loss function, given a dataset of $n$ i.i.d. samples from a distribution, while satisfying differential privacy with respect to the dataset. Most of the existing works in the literature of private convex optimization focus on the Euclidean (i.e., $\ell_2$) setting, where the loss is assumed to be Lipschitz (and possibly smooth) w.r.t. the $\ell_2$ norm over a constraint set with bounded $\ell_2$ diameter. Algorithms based on noisy stochastic gradient descent (SGD) are known to attain the optimal excess risk in this setting. In this work, we conduct a systematic study of DP-SCO for $\ell_p$-setups under a standard smoothness assumption on the loss. For $1< p\leq 2$, under a standard smoothness assumption, we give a new, linear-time DP-SCO algorithm with optimal excess risk. Previously known constructions with optimal excess risk for $1< p <2$ run in super-linear time in $n$. For $p=1$, we give an algorithm with nearly optimal excess risk. Our result for the $\ell_1$-setup also extends to general polyhedral norms and feasible sets. Moreover, we show that the excess risk bounds resulting from our algorithms for $1\leq p \leq 2$ are attained with high probability. For $2 < p \leq \infty$, we show that existing linear-time constructions for the Euclidean setup attain a nearly optimal excess risk in the low-dimensional regime. As a consequence, we show that such constructions attain a nearly optimal excess risk for $p=\infty$. Our work draws upon concepts from the geometry of normed spaces, such as the notions of regularity, uniform convexity, and uniform smoothness.  ( 3 min )
    Is Pessimism Provably Efficient for Offline RL?. (arXiv:2012.15085v3 [cs.LG] UPDATED)
    We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. Due to the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the dataset, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators. Without assuming the sufficient coverage of the dataset, we establish a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the information-theoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also minimax optimal. In particular, given the dataset, the learned policy serves as the "best effort" among all policies, as no other policies can do better. Our theoretical analysis identifies the critical role of pessimism in eliminating a notion of spurious correlation, which emerges from the "irrelevant" trajectories that are less covered by the dataset and not informative for the optimal policy.  ( 2 min )
    Automated Imbalanced Classification via Layered Learning. (arXiv:2205.02553v1 [cs.LG])
    In this paper we address imbalanced binary classification (IBC) tasks. Applying resampling strategies to balance the class distribution of training instances is a common approach to tackle these problems. Many state-of-the-art methods find instances of interest close to the decision boundary to drive the resampling process. However, under-sampling the majority class may potentially lead to important information loss. Over-sampling also may increase the chance of overfitting by propagating the information contained in instances from the minority class. The main contribution of our work is a new method called ICLL for tackling IBC tasks which is not based on resampling training observations. Instead, ICLL follows a layered learning paradigm to model the data in two stages. In the first layer, ICLL learns to distinguish cases close to the decision boundary from cases which are clearly from the majority class, where this dichotomy is defined using a hierarchical clustering analysis. In the subsequent layer, we use instances close to the decision boundary and instances from the minority class to solve the original predictive task. A second contribution of our work is the automatic definition of the layers which comprise the layered learning strategy using a hierarchical clustering model. This is a relevant discovery as this process is usually performed manually according to domain knowledge. We carried out extensive experiments using 100 benchmark data sets. The results show that the proposed method leads to a better performance relatively to several state-of-the-art methods for IBC.  ( 2 min )
    Holistic Approach to Measure Sample-level Adversarial Vulnerability and its Utility in Building Trustworthy Systems. (arXiv:2205.02604v1 [cs.CV])
    Adversarial attack perturbs an image with an imperceptible noise, leading to incorrect model prediction. Recently, a few works showed inherent bias associated with such attack (robustness bias), where certain subgroups in a dataset (e.g. based on class, gender, etc.) are less robust than others. This bias not only persists even after adversarial training, but often results in severe performance discrepancies across these subgroups. Existing works characterize the subgroup's robustness bias by only checking individual sample's proximity to the decision boundary. In this work, we argue that this measure alone is not sufficient and validate our argument via extensive experimental analysis. It has been observed that adversarial attacks often corrupt the high-frequency components of the input image. We, therefore, propose a holistic approach for quantifying adversarial vulnerability of a sample by combining these different perspectives, i.e., degree of model's reliance on high-frequency features and the (conventional) sample-distance to the decision boundary. We demonstrate that by reliably estimating adversarial vulnerability at the sample level using the proposed holistic metric, it is possible to develop a trustworthy system where humans can be alerted about the incoming samples that are highly likely to be misclassified at test time. This is achieved with better precision when our holistic metric is used over individual measures. To further corroborate the utility of the proposed holistic approach, we perform knowledge distillation in a limited-sample setting. We observe that the student network trained with the subset of samples selected using our combined metric performs better than both the competing baselines, viz., where samples are selected randomly or based on their distances to the decision boundary.  ( 2 min )
    Quantum Extremal Learning. (arXiv:2205.02807v1 [quant-ph])
    We propose a quantum algorithm for `extremal learning', which is the process of finding the input to a hidden function that extremizes the function output, without having direct access to the hidden function, given only partial input-output (training) data. The algorithm, called quantum extremal learning (QEL), consists of a parametric quantum circuit that is variationally trained to model data input-output relationships and where a trainable quantum feature map, that encodes the input data, is analytically differentiated in order to find the coordinate that extremizes the model. This enables the combination of established quantum machine learning modelling with established quantum optimization, on a single circuit/quantum computer. We have tested our algorithm on a range of classical datasets based on either discrete or continuous input variables, both of which are compatible with the algorithm. In case of discrete variables, we test our algorithm on synthetic problems formulated based on Max-Cut problem generators and also considering higher order correlations in the input-output relationships. In case of the continuous variables, we test our algorithm on synthetic datasets in 1D and simple ordinary differential functions. We find that the algorithm is able to successfully find the extremal value of such problems, even when the training dataset is sparse or a small fraction of the input configuration space. We additionally show how the algorithm can be used for much more general cases of higher dimensionality, complex differential equations, and with full flexibility in the choice of both modeling and optimization ansatz. We envision that due to its general framework and simple construction, the QEL algorithm will be able to solve a wide variety of applications in different fields, opening up areas of further research.  ( 2 min )
    Communication-Efficient Adaptive Federated Learning. (arXiv:2205.02719v1 [cs.LG])
    Federated learning is a machine learning training paradigm that enables clients to jointly train models without sharing their own localized data. However, the implementation of federated learning in practice still faces numerous challenges, such as the large communication overhead due to the repetitive server-client synchronization and the lack of adaptivity by SGD-based model updates. Despite that various methods have been proposed for reducing the communication cost by gradient compression or quantization, and the federated versions of adaptive optimizers such as FedAdam are proposed to add more adaptivity, the current federated learning framework still cannot solve the aforementioned challenges all at once. In this paper, we propose a novel communication-efficient adaptive federated learning method (FedCAMS) with theoretical convergence guarantees. We show that in the nonconvex stochastic optimization setting, our proposed FedCAMS achieves the same convergence rate of $O(\frac{1}{\sqrt{TKm}})$ as its non-compressed counterparts. Extensive experiments on various benchmarks verify our theoretical analysis.  ( 2 min )
    Bivariate vine copula based quantile regression. (arXiv:2205.02557v1 [stat.ME])
    The statistical analysis of univariate quantiles is a well developed research topic. However, there is a profound need for research in multivariate quantiles. We tackle the topic of bivariate quantiles and bivariate quantile regression using vine copulas. They are graph theoretical models identified by a sequence of linked trees, which allow for separate modelling of marginal distributions and the dependence structure. We introduce a novel graph structure model (given by a tree sequence) specifically designed for a symmetric treatment of two responses in a predictive regression setting. We establish computational tractability of the model and a straight forward way of obtaining different conditional distributions. Using vine copulas the typical shortfalls of regression, as the need for transformations or interactions of predictors, collinearity or quantile crossings are avoided. We illustrate the copula based bivariate quantiles for different copula distributions and provide a data set example. Further, the data example emphasizes the benefits of the joint bivariate response modelling in contrast to two separate univariate regressions or by assuming conditional independence, for bivariate response data set in the presence of conditional dependence.  ( 2 min )
    The interventional Bayesian Gaussian equivalent score for Bayesian causal inference with unknown soft interventions. (arXiv:2205.02602v1 [stat.ME])
    Describing the causal relations governing a system is a fundamental task in many scientific fields, ideally addressed by experimental studies. However, obtaining data under intervention scenarios may not always be feasible, while discovering causal relations from purely observational data is notoriously challenging. In certain settings, such as genomics, we may have data from heterogeneous study conditions, with soft (partial) interventions only pertaining to a subset of the study variables, whose effects and targets are possibly unknown. Combining data from experimental and observational studies offers the opportunity to leverage both domains and improve on the identifiability of causal structures. To this end, we define the interventional BGe score for a mixture of observational and interventional data, where the targets and effects of intervention may be unknown. To demonstrate the approach we compare its performance to other state-of-the-art algorithms, both in simulations and data analysis applications. Prerogative of our method is that it takes a Bayesian perspective leading to a full characterisation of the posterior distribution of the DAG structures. Given a sample of DAGs one can also automatically derive full posterior distributions of the intervention effects. Consequently the method effectively captures the uncertainty both in the structure and the parameter estimates. Codes to reproduce the simulations and analyses are publicly available at github.com/jackkuipers/iBGe  ( 2 min )
    Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline Reinforcement Learning. (arXiv:2205.02450v1 [cs.LG])
    Dynamic mechanism design has garnered significant attention from both computer scientists and economists in recent years. By allowing agents to interact with the seller over multiple rounds, where agents' reward functions may change with time and are state dependent, the framework is able to model a rich class of real world problems. In these works, the interaction between agents and sellers are often assumed to follow a Markov Decision Process (MDP). We focus on the setting where the reward and transition functions of such an MDP are not known a priori, and we are attempting to recover the optimal mechanism using an a priori collected data set. In the setting where the function approximation is employed to handle large state spaces, with only mild assumptions on the expressiveness of the function class, we are able to design a dynamic mechanism using offline reinforcement learning algorithms. Moreover, learned mechanisms approximately have three key desiderata: efficiency, individual rationality, and truthfulness. Our algorithm is based on the pessimism principle and only requires a mild assumption on the coverage of the offline data set. To the best of our knowledge, our work provides the first offline RL algorithm for dynamic mechanism design without assuming uniform coverage.  ( 2 min )
    Generative methods for sampling transition paths in molecular dynamics. (arXiv:2205.02818v1 [stat.ML])
    Molecular systems often remain trapped for long times around some local minimum of the potential energy function, before switching to another one -- a behavior known as metastability. Simulating transition paths linking one metastable state to another one is difficult by direct numerical methods. In view of the promises of machine learning techniques, we explore in this work two approaches to more efficiently generate transition paths: sampling methods based on generative models such as variational autoencoders, and importance sampling methods based on reinforcement learning.  ( 2 min )
    Polynomial-Time Algorithms for Counting and Sampling Markov Equivalent DAGs with Applications. (arXiv:2205.02654v1 [cs.LG])
    Counting and sampling directed acyclic graphs from a Markov equivalence class are fundamental tasks in graphical causal analysis. In this paper we show that these tasks can be performed in polynomial time, solving a long-standing open problem in this area. Our algorithms are effective and easily implementable. As we show in experiments, these breakthroughs make thought-to-be-infeasible strategies in active learning of causal structures and causal effect identification with regard to a Markov equivalence class practically applicable.  ( 2 min )
    REDS: Rule Extraction for Discovering Scenarios. (arXiv:1910.01713v2 [cs.LG] UPDATED)
    Scenario discovery is the process of finding areas of interest, known as scenarios, in data spaces resulting from simulations. For instance, one might search for conditions, i.e., inputs of the simulation model, where the system is unstable. Subgroup discovery methods are commonly used for scenario discovery. They find scenarios in the form of hyperboxes, which are easy to comprehend. Given a computational budget, results tend to get worse as the number of inputs of the simulation model and the cost of simulations increase. We propose a new procedure for scenario discovery from few simulations, dubbed REDS. A key ingredient is using an intermediate machine learning model to label data for subsequent use by conventional subgroup discovery methods. We provide statistical arguments why this is an improvement. In our experiments, REDS reduces the number of simulations required by 50--75\% on average, depending on the quality measure. It is also useful as a semi-supervised subgroup discovery method and for discovering better scenarios from third-party data, when a simulation model is not available.  ( 2 min )
    Dynamic Bayesian Network Auxiliary ABC-SMC for Hybrid Model Bayesian Inference to Accelerate Biomanufacturing Process Mechanism Learning and Robust Control. (arXiv:2205.02410v1 [stat.ML])
    Driven by the critical needs of biomanufacturing 4.0, we present a probabilistic knowledge graph hybrid model characterizing complex spatial-temporal causal interdependencies of underlying bioprocessing mechanisms. It can faithfully capture the important properties, including nonlinear reactions, partially observed state, and nonstationary dynamics. Given limited process observations, we derive a posterior distribution quantifying model uncertainty, which can facilitate mechanism learning and support robust process control. To avoid evaluation of intractable likelihood, Approximate Bayesian Computation sampling with Sequential Monte Carlo (ABC-SMC) is developed to approximate the posterior distribution. Given high stochastic and model uncertainties, it is computationally expensive to match process output trajectories. Therefore, we propose a linear Gaussian dynamic Bayesian network (LG-DBN) auxiliary likelihood-based ABC-SMC algorithm. Through matching observed and simulated summary statistics, the proposed approach can dramatically reduce the computation cost and accelerate the posterior approximation convergence.  ( 2 min )
    Group-Invariant Quantum Machine Learning. (arXiv:2205.02261v1 [quant-ph])
    Quantum Machine Learning (QML) models are aimed at learning from data encoded in quantum states. Recently, it has been shown that models with little to no inductive biases (i.e., with no assumptions about the problem embedded in the model) are likely to have trainability and generalization issues, especially for large problem sizes. As such, it is fundamental to develop schemes that encode as much information as available about the problem at hand. In this work we present a simple, yet powerful, framework where the underlying invariances in the data are used to build QML models that, by construction, respect those symmetries. These so-called group-invariant models produce outputs that remain invariant under the action of any element of the symmetry group $\mathfrak{G}$ associated to the dataset. We present theoretical results underpinning the design of $\mathfrak{G}$-invariant models, and exemplify their application through several paradigmatic QML classification tasks including cases when $\mathfrak{G}$ is a continuous Lie group and also when it is a discrete symmetry group. Notably, our framework allows us to recover, in an elegant way, several well known algorithms for the literature, as well as to discover new ones. Taken together, we expect that our results will help pave the way towards a more geometric and group-theoretic approach to QML model design.  ( 2 min )
    DeepBayes -- an estimator for parameter estimation in stochastic nonlinear dynamical models. (arXiv:2205.02264v1 [stat.ML])
    Stochastic nonlinear dynamical systems are ubiquitous in modern, real-world applications. Yet, estimating the unknown parameters of stochastic, nonlinear dynamical models remains a challenging problem. The majority of existing methods employ maximum likelihood or Bayesian estimation. However, these methods suffer from some limitations, most notably the substantial computational time for inference coupled with limited flexibility in application. In this work, we propose DeepBayes estimators that leverage the power of deep recurrent neural networks in learning an estimator. The method consists of first training a recurrent neural network to minimize the mean-squared estimation error over a set of synthetically generated data using models drawn from the model set of interest. The a priori trained estimator can then be used directly for inference by evaluating the network with the estimation data. The deep recurrent neural network architectures can be trained offline and ensure significant time savings during inference. We experiment with two popular recurrent neural networks -- long short term memory network (LSTM) and gated recurrent unit (GRU). We demonstrate the applicability of our proposed method on different example models and perform detailed comparisons with state-of-the-art approaches. We also provide a study on a real-world nonlinear benchmark problem. The experimental evaluations show that the proposed approach is asymptotically as good as the Bayes estimator.  ( 2 min )
    Multivariate Prediction Intervals for Random Forests. (arXiv:2205.02260v1 [stat.ML])
    Accurate uncertainty estimates can significantly improve the performance of iterative design of experiments, as in Sequential and Reinforcement learning. For many such problems in engineering and the physical sciences, the design task depends on multiple correlated model outputs as objectives and/or constraints. To better solve these problems, we propose a recalibrated bootstrap method to generate multivariate prediction intervals for bagged models and show that it is well-calibrated. We apply the recalibrated bootstrap to a simulated sequential learning problem with multiple objectives and show that it leads to a marked decrease in the number of iterations required to find a satisfactory candidate. This indicates that the recalibrated bootstrap could be a valuable tool for practitioners using machine learning to optimize systems with multiple competing targets.  ( 2 min )

  • Open

    Does anyone know of an app that can rewrite an entire document in one go to sound more professional?
    I know there's Grammarly but is there something that does it all in one go to make it more cohesive? ​ Thanks! submitted by /u/chriscarmy [link] [comments]  ( 1 min )
    IVY: An Open-Source Tool To Make Deep Learning Code Compatible Across Frameworks
    As ML aficionados, we’ve all come across interesting projects on GitHub only to discover that they are not in the framework we want and are familiar with. It can be tedious at times to reimplement the whole codebase in our framework, let alone deal with any errors that may arise throughout the process. It is a tedious chore that no one wants to do. Isn’t it good to have something that doesn’t care what framework you’re using? It will provide you with code in your desired framework, whether it is JAX, PyTorch, MXNet, Numpy, or TensorFlow. This is what IVY is attempting to do by unifying all ML frameworks. The number of open-source machine learning projects has surged significantly over the past. This is evident by the fast-growing number of Github repositories using the keyword Deep learning. Because of different frameworks, code sharability has been considerably hampered. Aside from that, many frameworks become obsolete in comparison to newer frameworks. For software development where collaboration is vital, this is a significant bottleneck. As newer frameworks come into the scene framework-specific code quickly becomes obsolete, and transferring code across frameworks is akin to reinventing the wheel. Continue Reading GitHub: https://github.com/unifyai/ivy Paper: https://arxiv.org/pdf/2102.02886.pdf Project: https://lets-unify.ai/ivy/ submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Any apps to have conversations with AI talking heads?
    I'm a software engineer and I've become interested in making an app where you can have a conversation with an AI generated talking head. I guess the goal would feel like having a videocall conversation with another human, except its all AI generated. TBH It wouldn't have to be human, even a cartoon character would be cool. The closest thing I can find suggests combining: Face Generation (StyleGAN), Text Generation (GPT), Text-To-Speech (FlowTron), and Lip-Sync Animation (LipGAN). https://medium.com/swlh/how-to-create-fake-talking-head-videos-with-deep-learning-code-tutorial-f9cfc0c19ab5 Does anybody know: Of any apps that are already doing something similar? How to achieve this in an app? Most tutorials seem to be Python which won't run on the frontend so wondering if there's anything Swift, Java, Kotlin, React Native etc based. Thanks submitted by /u/djames843 [link] [comments]  ( 1 min )
    Why won't AI ever replace humans in the Healthcare field? I guess we don't have the technology right now, but won't AI be a lot more accurate and safer to diagnose diseases, etc?
    submitted by /u/UmbraShield [link] [comments]  ( 1 min )
    OpenAI founder Sam Altman sees a big AI revolution within this decade
    submitted by /u/much_successes [link] [comments]  ( 1 min )
    Paintings.
    submitted by /u/cookingandcraft [link] [comments]
    Best Natural Language Processing Courses you might know 2022
    Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can. Natural language processing (NLP) describes the interaction between human language and computers. It's a technology that many people use daily and has been around for years, but is often taken for granted. A few examples of NLP that people use every day are: Spell check. Looking gain your NLP skills then here is the Best Natural Language Processing Courses you might know in 2022 submitted by /u/Lakshmireddys [link] [comments]  ( 1 min )
    Meta's open-source new model OPT is GPT-3's closest competitor!
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
  • Open

    [D] Any good podcast discussing ML papers?
    I was wondering if there are any good podcasts discussing recent papers in ML space? Something similar to Yannic's videos but in podcast format. The closest one I have found is "TWIML AI". submitted by /u/KeikakuAccelerator [link] [comments]  ( 1 min )
    [D] Did ICCV Dramatically Change its Template?
    I'm preparing to re-submit my paper to conference, and I'm looking at ICCV 2023. The call for papers provides a template here, but it doesn't look a thing like the template from the most-recently released proceedings, ICCV 2021. The page limit (15 pages!!) is also dramatically different. Am I looking at the same ICCV? If so, does anyone know what the impetus was for the change? I can't say the new one looks nicer, though I am happy about more generous space limits (only feels appropriate in the age where most of these papers will only ever be read digitally...) submitted by /u/relenzo [link] [comments]  ( 1 min )
    [D] How does SHAP handle removed features?
    I understood that SHAP computes predictions for each possible feature combination. But how can this be done simply passing clf.predict_proba in parameters? Does it retrain the model every time the number of features change for a combination? submitted by /u/savoga [link] [comments]  ( 1 min )
    API call-sequence and user-behaviour datasets “[R]” “[P]”
    Distributed micro-services based applications are typically accessed via APIs. These APIs are used either by apps or they can be accessed directly via programmatic means. Many a time API access is abused by attackers trying to exploit the business logic exposed by these APIs. The way normal users access these APIs is different from how the attackers access these APIs. Many applications have 100s of APIs that are called in specific order and depending on various factors such as browser refreshes, session refreshes, network errors, or programmatic access these behaviors are not static and can vary for the same user. API calls in long running sessions form access graphs that need to be analysed in order to discover attack patterns and anomalies. Graphs dont lend themselves to numerical computation. We address this issue and provide a dataset where user access behavior is qualified as numerical features. In addition we provide a dataset where raw API call graphs are provided. Supporting the use of these datasets two notebooks on classification, node embeddings and clustering are also provided. checkout the dataset and some notebooks that show how the data set can be used at https://www.kaggle.com/datasets/tangodelta/api-access-behaviour-anomaly-dataset submitted by /u/indra_gunt [link] [comments]  ( 1 min )
    [R] Looking for resources on hybrid modeling using known physics and machine learning
    For my master thesis (mechanical engineering) I'm working on a hybrid model of a robotic manipulator. The goal is to extend the existing, physics-based, model with a machine learning component that aims to learn the unmodelled dynamics. I've read about PINN's and PGNN's, but they don't quite seem to capture the methodology I'm looking for. To give a clearer direction, I want to develop a model similar to this paper. Thanks in advance. submitted by /u/MalleMinkukel [link] [comments]  ( 1 min )
    [R] Early Stoppable/Iterable Pre-trained Object Recognition Model
    Hey, I am currently working on my thesis in the field of microservices in mission-critical applications, and I am searching for some kind of applications (especially object recognition) that I can stop if a certain threshold is reached. At the point where the application stops, a certain result must already be present. Would be nice if anyone has an idea. Feel free to ask if you have any further questions. submitted by /u/mxsrv [link] [comments]  ( 1 min )
    Leaking private data from gradients [D]
    In "Communication-efficient learning of deep networks from decentralized data" (https://proceedings.mlr.press/v54/mcmahan17a/mcmahan17a.pdf) McMahan et al write the following (in footnote 1 on page 2): "If the update is the total gradient of the loss onall of the local data, and the features are a sparse bag-of-words,then the non-zero gradients reveal exactly which words the userhas entered on the device.". For context, this paper assumes a set of clients have local private data and a global, joint model over all that data is to be learnt without collecting it to a central server. Clients compute gradients on local data in rounds, apply them and share updated weights with a central server. The footnote above refers to gradients, which are usually shared in distributed optimization, as opposed to parameters. The central server usually averages the gradients before a gradient descent step in that setting. I tried figuring out how the gradient of the loss with respect to one private, local client dataset is supposed to leak the text that user has typed (given bag-of-words features used in model), but haven't quite managed. There is neither proof nor citation in the paper, so it must be something obvious. submitted by /u/lemlo100 [link] [comments]  ( 2 min )
    [D] Open course or study material on signal processing for machine learning
    Hello Me and my colleagues are looking for courses or study material on speech processing for machine learning (signal processing, text-to-speech, etc...). It seems to us that it is harder to find compared to NLP, Computer Vision courses in youtube and coursera. Do any of you guys know some good lectures, slides, or study material related to signals? It can be anything from basics of signal processing to applications in machine learning. Any help will be appreciated! Thank you. submitted by /u/HotRecognition0121 [link] [comments]  ( 1 min )
    [D] Do you use NLTK or Spacy for text preprocessing?
    Nowadays people just pass their text through the default vectorizers pretrained along with BERT, RoBERTa from HuggingFace. But, if you trained your own model would you use spacy or nltk? NLTK seems kinda outdated.. Also, do HF tokenizers (pretrained) require some cleaning of the text beforehand? (eliminating numbers, hashtags etc.) i.e. is there any need for the libraries from the title in this case? submitted by /u/Icy_Fisherman7187 [link] [comments]  ( 2 min )
    [D] Bachelor Thesis: Research question
    Hello dear ML-enthusiasts, herewith I turn to you as a newbie in the ML field, hoping to get inspiration for the formulation of my research question. I am interested in the great potential of ML (especially Neural Networks) and therefore I would like to write my thesis in this area to gain first insights in this revolutionary technology. In Germany, it is common as a bachelor thesis to conduct a literature analysis and to examine individual concepts, compare them and determine the state of the art. I have read about 20 papers and keep getting lost down a rabbit hole that is way beyond my expertise. I would be very grateful for any key words/concepts/ideas that are interesting that I could look up for further research. Thanks in advance :)) submitted by /u/Hundenberg [link] [comments]  ( 1 min )
  • Open

    Sequence length in LSTM
    In this PPO-LSTM architecture, at some point there is a sequence length variable (https://github.com/MarcoMeter/recurrent-ppo-truncated-bptt/blob/9206a97b7546ec62e668eaf67ae6d4b752e0f0ee/model.py#L79). If you look at the configs, it is always set to 8 in each environment (https://github.com/MarcoMeter/recurrent-ppo-truncated-bptt/blob/9206a97b7546ec62e668eaf67ae6d4b752e0f0ee/configs.py#L15). It is described as "length of the fed sequences". Can you make an example so that it is clearer what this is referring to? Thanks! submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Preprocessing layers applied to an observation space that contains direction and position of the agent as well as a global observation
    I'm trying to figure out what this code is doing but I have a hard time understanding the observation space here. The observation space is passed as an input, then there are some preprocessing layers (built as a dictionary) in which: - The key "policy state" (which is the hidden state of the LSTM that is controlling the policy) is associated to a few linear layers - The key "image" is associated to conv nets - The key "direction" (which is assigned to scalar name) is associated to a fully connected layer - The key "position" is associated to a single fully connected layer as well. What I don't understand is: how do you get such an observation space, containing position and direction of the agent as well as an observation of the environment? ## Linear layer preprocessing_layers = { "policy_state": torch.nn.Linear.Lambda(lambda x: x) } ## A convolution layer processes the image preprocessing_layers["image"] = torch.nn.Sequential([ torch.nn.Conv2d(obs_space.shape[0], conv_filters, 1), torch.nn.ReLU()]) ## The scalar inputs are processed with a single fully connected layer of size 5 ## Direction that the agent is facing if scalar_name in obs_space: preprocessing_layers[scalar_name] = torch.nn.Sequential( [utils.one_hot_layer(scalar_dim), torch.nn.Linear(scalar_fc)]) ## Position of the agent if "position" in obs_space: preprocessing_layers["position"] = torch.nn.Sequential([torch.nn.Linear(scalar_fc)]) submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Question about Multi agent reinforcement learning
    Is it possible that two agent will become smarter when competing each other in an environment?, given that they know nothing about the environment at the beginning. submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
    research on object manipulation
    Can someone please tell what's the progress in research on object manipulation using robotic arm... is it worth start doing research on this topic or a worthfull topic from futuristic point of view submitted by /u/Western-Age3148 [link] [comments]
    Q-Learning on OpenAI Gym Never Gets Reward
    I am very new to reinforcement Q-learning and I recently got through a tutorial based on https://gym.openai.com/envs/FrozenLake-v0/. The tutorial link is here: https://deeplizard.com/learn/video/HGeI30uATws. I went through the tutorial and I just could not get what it promised to work. Basically, after doing everything listed, I should be able to make my agent play Frozen Lake with about 70% accuracy. However, all I get is 0% accuracy. Naturally, before asking here, I would debug to see what is wrong, and I have some suspects. I see that for each episode, my agent caps out at the iteration I defined, thereby receiving no reward: Episode 9913 ended after 100 iterations. Rewards = 0.0 Episode 9914 ended after 100 iterations. Rewards = 0.0 Episode 9915 ended after 100 iterations. Rewards = 0…  ( 7 min )
    "Concurrent Training of a Control Policy and a State Estimator for Dynamic and Robust Legged Locomotion", Ji et al 2022
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Gym Custom Environment action_space
    Hi everyone, I am new to RL and I am currently working on a custom environment with gym. I want my action to be either -1 or 1. How do I do set the action_space for that? submitted by /u/Select-Blackberry451 [link] [comments]  ( 1 min )
  • Open

    Process larger and wider datasets with Amazon SageMaker Data Wrangler
    Amazon SageMaker Data Wrangler reduces the time to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio. Data Wrangler can simplify your data preparation and feature engineering processes and help you with data selection, cleaning, exploration, and visualization. Data Wrangler has over 300 built-in transforms written in PySpark, […]  ( 5 min )
    Fine-tune transformer language models for linguistic diversity with Hugging Face on Amazon SageMaker
    Approximately 7,000 languages are in use today. Despite attempts in the late 19th century to invent constructed languages such as Volapük or Esperanto, there is no sign of unification. People still choose to create new languages (think about your favorite movie character who speaks Klingon, Dothraki, or Elvish). Today, natural language processing (NLP) examples are […]  ( 13 min )
    Build a custom Q&A dataset using Amazon SageMaker Ground Truth to train a Hugging Face Q&A NLU model
    In recent years, natural language understanding (NLU) has increasingly found business value, fueled by model improvements as well as the scalability and cost-efficiency of cloud-based infrastructure. Specifically, the Transformer deep learning architecture, often implemented in the form of BERT models, has been highly successful, but training, fine-tuning, and optimizing these models has proven to be […]  ( 19 min )
  • Open

    ANN in Keras
    Hello everyone, I am trying to make a keras neural network model. I have made a custom gym environment and since I am new to machine learning, I am confused on how many layers to use and which to use. My observation space is: self.observation_space = spaces.Box(low=-1, high=1, shape=(4,6), dtype=np.float16). My action space is a discrete value of 2. I am trying to implement DQN and I get the following error: DQN expects a model that has one dimension for each action, in this case 2. I think the problem is with my observation space and action space dimensions, but I am unsure how to fix it. Any help is deeply appreaciated! submitted by /u/Select-Blackberry451 [link] [comments]  ( 1 min )
    Neural Network to approximate function
    Hello I am trying to create a Neural Network to predict af function (Rosenbrock function). I desire to use the toolbox provided by MATLAB but i don't know which function to use. Can someone help me? The Rosenbrock function looks like this: f(x,y) = (1 - x)2 + 100 (y - x2)2 Hope someone is willing to help me. submitted by /u/TobiasFred [link] [comments]  ( 1 min )
    NN from Scratch: #6 Building the model | Kolbenkraft
    submitted by /u/cjmodi306 [link] [comments]
  • Open

    Cloud Document Management Is the Future — Here’s Why
    Organizations today have been using some form of document management for years, whether on paper, computer, or online. While we at Bentech…  ( 2 min )
  • Open

    Escaping the Big Data Paradigm with Compact Transformers
    Highlights  ( 2 min )

  • Open

    Use custom vocabulary in Amazon Lex to enhance speech recognition
    In our daily conversations, we come across new words or terms that we may not know. Perhaps these are related to a new domain that we’re just getting familiar with, and we pick these up as we understand more about the domain. For example, home loan terminology (“curtailment”), shortened words, (“refi”, “comps”), and acronyms (“HELOC”) […]  ( 6 min )
    Predict customer churn with no-code machine learning using Amazon SageMaker Canvas
    Understanding customer behavior is top of mind for every business today. Gaining insights into why and how customers buy can help grow revenue. But losing customers (also called customer churn) is always a risk, and insights into why customers leave can be just as important for maintaining revenues and profits. Machine learning (ML) can help […]  ( 9 min )
  • Open

    Accuracy of prediction
    There is any technique for the NN to estimate the accuracy of his prediction? I do not mean just listing how close to 1 are the output cells in a classification, but actually giving a probability of the result being correct. I'm thinking on predicting time series, but I wish to get a more general answer. submitted by /u/RexurrectionOfDoom [link] [comments]
    does anyone have links to papers that use Neural-nets to predict the weather?
    submitted by /u/20io_anarchist [link] [comments]
    CNN applied on EEG data - Better to feed the network Raw data or Power Spectrum Density (PSD)
    I'm analyzing EEG data using a convolutional neural network. I could feed the network either the raw data (with some preprocessing), or I could estimate the PSD using for instance Welch's method, and instead feed this to the network. Both implementations exist in academic literature, but it's never well explained why one would choose one or the other. I get that using the PSD inherently applies some filtering, and that this might improve feature extraction by the CNN in some cases, but which cases? And why? A ready-made answer would be amazing, but I would already be very grateful for some links towards resources that cover this. submitted by /u/Qosarom [link] [comments]  ( 1 min )
    Nndl applications in defense
    Does anyone know how cnn is used in defense applications submitted by /u/Actual-Performer-832 [link] [comments]
    Architecture suggestion for multi-class classification task on images with captions
    I am currently starting on a multi-class classification task on images with a small description of the images. We are given training data on images, their captions, and labels and we need to construct an architecture to classify labels of a new image with captions. I have done some research online and was suggested that the best architecture for multi-class classification task for images would be a combination of CNN and RNN but I couldn't find anywhere on how to utilize the caption with the images together for training. Any advice on where I should start? submitted by /u/anzhuoxianshen [link] [comments]  ( 1 min )
  • Open

    How do you initialize a recurrent layer if you don't know the exact size of the input yet?
    Let's say I have a function that processes my observation in a way that makes it difficult to predict the size of the input for the subsequent recurrent layer. In the init method, how do I initialize such recurrent layer? submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    PongNoFrameskip agent learns to do nothing
    Hello! Currently I'm working on a DQN agent for the PongNoFrameskip-v0/4 Atari game. The network itself is really simple, it's the same as in almost all of the tutorials (3 Conv+ReLU and 2 FC(512)). I followed several tutorials, and created my own solution for the problem, but for some reason, my agent only learns to do nothing, after 200 hundred episodes (~190k steps). The tutorial codes are working well (with the same hyperparameters and frame processing as my agent), their agent starts to learn something after 120k episodes, and I can't figure out, what am I doing wrong. Maybe it is a small thing, just I can't see it. Can anyone who is more advanced in the topic help me with it? I can provide the code, it can run on Colab. submitted by /u/Drotos7 [link] [comments]  ( 1 min )
    queries regarding doubts on research papers
    can someone tell the platform where i can ask the queries of a particular paper on deep reinforcement learning. submitted by /u/Western-Age3148 [link] [comments]
    What happens if you don't mask the hidden states of a recurrent policy?
    What happens if you don't reset the hidden states to zero when the environment is done during training? submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    in MARL, does sharing a policy mean sharing the action space?
    I am a bit confused. In this repository (https://github.com/zoeyuchao/mappo), the authors claim that "by default all experiments assume a shared policy by all agents, i.e. there is one neural net shared by all agents. However, in the code I see that a) there is the option of sharing the observation space b) there is no option of sharing the action space. So what does that sentence mean exactly? Thanks! submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    input to the target networks in actor critic algorithms
    What would be the input to the target networks when we are using this for the object manipulation using robotic arm, and how this input will be different from the main actor critic networks submitted by /u/Western-Age3148 [link] [comments]  ( 1 min )
    MADDPG: model performs not good enough
    I'm using the MADDPG algorithm, trying to control traffic signals in an urban network. I expected it would perform better than traditional methods (e.g., solved by Synchro) but failed. Also, I found that after training 1000 episodes, it just improved not much better than initialized network output without noise. I suspect my model did not learn anything, but I have no idea to solve this problem. And here are the figures for the average episodic reward and each agent's critic loss. The losses dropped rapidly in the first 200 episodes and then were smooth until the end. The average reward converged gradually with time going on, but maybe the reason is related to the noise gradually going down, not because the model successfully learned a better policy. Each agent's critic loss Episodic reward I've uploaded my code to my GitHub, and hope someone can take a look and give me some directions to modify my code properly. submitted by /u/ntuce002 [link] [comments]  ( 1 min )
    Advice regarding Simulation Environment for Master Thesis
    Hi all, I started my Master Thesis recently in which I have to implement different RL Agents to basically control a multi axis machine to fit puzzle pieces into their designated positions & evaluate the different performances over the next 22 weeks. I was given a simulation environment on Unity which is by far insufficient to reflect the complexity and physics of the task, hence I will have to either rebuild a simulation env. from scratch or adjust the given one. In the recent weeks I familiarized myself a bit with Unity & Mujoco and have to make the decision now to Work & adjust the existing Unity env. Build a new one on MuJoCo (or maybe pybullet) Work & adjust the existing Unity env + use the MuJoCo unity plug-in to use the physics engine To be said, I am familiar with the deeper theory & math of the different RL algorithms (Q-Learning, Policy Gradients, SAC), but I do not have any practical implementation experience of them yet. My concerns regarding building the environment completely new are to loose to much time on that, as my thesis is about the RL approaches and not the sim. env., but on the other hand it might pay off investing that time to have a solid env. to train the agents on. ​ Do you guys have any advice or have experience in a similar situation? ​ Cheers submitted by /u/disdisinform [link] [comments]  ( 2 min )
    Will CNN be useful for training an agent to play connect 4? I am thinking about implement CNN without pooling cause I dont want to lose any information for the agent to learn? is 4x4 kernel a good start? (new to CNN)
    submitted by /u/Professional_Card176 [link] [comments]  ( 2 min )
    What does the big E (expected value..?) in bellman's equation really mean?
    submitted by /u/Spencerbug [link] [comments]  ( 1 min )
  • Open

    [P] Can anyone suggest free Image annotation tool for multi labelling?
    I am annotating handwritten texts. by multi label i mean i am storing multiple information with a single bounded component. lets say there i am classifying a text as "clean text" and "unclean text" Then with clean text the content of the text, language and if its an math expression or not is getting stored. I have been using plainsight. and it's sublabel feature was doing fine but i just noticed that it is paid after a month of free use. I am curious if LabelMe does this as in the tutorial it is not explicitly said that a single component can have multiple information or classes associated with it. submitted by /u/slowturtle56 [link] [comments]  ( 1 min )
    [D] What venues accept high quality surveys?
    I am specifically asking about high-profile applications within the Computer Vision or Deep Learning space. I am just curious what sort of venue would be appropriate for a survey? submitted by /u/AbjectDrink3276 [link] [comments]  ( 1 min )
    [D] This Ape Does Not Exist! I trained a StyleGAN2 on the Bored Ape Yacht Club NFT Collection (YouTube Video)
    https://youtu.be/Pm93D8CVlY8 Today we build our own AI that can create as many bored apes as we want! Fungibility for everyone! OUTLINE: 0:00 - Introduction 2:05 - Generative Adversarial Networks 3:40 - Scraping Opensea with BrightData 7:55 - Training the GAN 11:35 - Here are the results! 15:20 - Diving deeper into BrightData ​ Try the model here: https://huggingface.co/spaces/ykilcher/apes or here: https://ykilcher.com/apes Files & Models here: https://huggingface.co/ykilcher/apes/tree/main Code here: https://github.com/yk/apes-public submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [P] Start-up puts an end to hassle around vehicle damage inspections
    Lensor uses cameras and machine learning to inspect a car in 7 seconds, making over 200 photos. These are checked by a system that was trained on a huge collection of photos of damage. In the future this will replace walking around the car at the rental station, filling in forms. submitted by /u/DutchTechJunkie [link] [comments]  ( 1 min )
    [R] Scaled up CLIP-like model (~2B) shows 86% Zero-shot on Imagenet
    Paper: https://arxiv.org/abs/2205.01917 ​ Impressive performance on diverse datasets may indicate higher generalizability :) [Without \"task-specific\" customizations] Confirms that multi-modal models can scale further from single-digit Billion params (who would've thought) and scales up an simple CLIP-like model showing substantial improvements - especially in 0-shot domain. Simple Contrastive learning appears more and more promising for multi-modal objectives... Overall, its nothing novel - simply scaled up research. submitted by /u/Competitive-Rub-1958 [link] [comments]  ( 2 min )
    [R] ExSum: From Local Explanations to Model Understanding
    Excited to share our latest research on model interpretability, to appear at NAACL this summer. In this paper, we reflect on local model explanations (e.g. LIME, SHAP, gradient saliency), and think about how people actually use them to derive high-level model understanding (e.g. is the model relying on spurious correlation, is it biased, can I trust it). Obviously, they need to be correct (or faithful), which has been the focus of many interpretability evaluations. However, we argue that they also need to be understandable, and propose explanation summary (ExSum), the first mathematical framework to quantify this understandability aspect. Using it, we demonstrate that a formal practice of quantifying model understanding leads to better and more awareness of subtle model behaviors that would be easily missed if we were just to inspect a few local explanations in an ad hoc way. Happy to answer any questions. Paper: https://arxiv.org/pdf/2205.00130.pdf Project website: https://yilunzhou.github.io/exsum/ MIT News release: https://news.mit.edu/2022/machine-learning-explainability-0505 submitted by /u/zyl1024 [link] [comments]  ( 2 min )
    [D] what is a good model choice for detection of small numbers in a picture (followed by classification)
    I want to detect jersey numbers in the back of soccer players. My current model uses stn but i am afraid it is not the best approach. I know that there are plenty of different object detection models but i dont know enough to make a choice. My data has the bounding box coordinates and the correct number, so i could give double supervision submitted by /u/TheManveru [link] [comments]  ( 1 min )
    [D] Why is NLP given so much attention and resources?
    I don't really see any models, besides DM's alpha go, mu zero, etc, gain the attention or resources that NLP projects like GPT-3 and others get, so I am curious as to why that is and why aren't there many other ig model "types" with as many parameters besides NLP ones. Is it due to more clear applications and more probable ROI? Is there a view that NLP models like these have a higher chance at leading to more general intelligence, such as the use of these models in code generation and other problem solving tasks? submitted by /u/Southern-Trip-1102 [link] [comments]  ( 4 min )
    [D] Scaling in Multivariate/Multi-output regression
    Hello, I have been trying to improve the accuracy of my model. Model: Input is a set of measurements (one set in mm and another set in degrees) Output is a set of predictions (again one set in mm and another set in degrees) Now as you can imagine, there are times when the data has completely different scales. I wanted to know what the effect of scaling the input data has on the outcome. I'm using sklearn's MultiOutputRegressor on top of lgbm.LGBMRegressor Also, is there a better way to score my model? The default R² score is fine, but when I try to pass a custom callback to model.fit() with the keyword eval_metric, it isn't being used. submitted by /u/translunarinjection [link] [comments]  ( 1 min )
    [D] Help: Finding a paper about self-supervised pretraining being limited on some datasets and not yielding good results on some
    Any literature that explains that self-supervised pretraining sometimes doesn't work on other datasets? I have used SWaV and SimCLR but they are not performing better and sometimes worse. submitted by /u/sarmientoj24 [link] [comments]  ( 1 min )
    GAN Literature Review Recommendations? [D]
    Hi, I'm looking to do a lit review of GANs before starting a project - only problem is I've lost touch with developments in the last 2-3 years. Does anyone have any papers or key advancements they could recommend to me within the last 3 years or so? Would appreciate any input. Thanks submitted by /u/blahbloopooo [link] [comments]  ( 1 min )
    [D] What is in your perfect dev environment?
    Regardless of where you're running, what are the must have tools in your ML arsenal (and perhaps why you need it so much)? This is the stuff that you would want pre-installed on whatever environment you'd want to use, whether that's a container, VM, or bare metal server. The reason I ask is because I'm looking for ways to make the default run image on RunPod a bit more friendly. Or maybe it makes sense to have different environments for different major use cases? Is it a fools errand to try to cram everything into one, or even a few, buckets? Let me know in the comments! TIA :) submitted by /u/runpod-io [link] [comments]  ( 2 min )
  • Open

    OpenAI Leadership Team Update
    We’re happy to announce several executive role changes that reflect our recent progress and will ensure continued momentum toward our next major milestones.  ( 1 min )
  • Open

    Last Week in AI: AI Kills Cookie Pop-Ups, Models Volcanoes, Screens for Child Neglect, Paints Harry Potter
    submitted by /u/regalalgorithm [link] [comments]
    Iterative Introduces New ML Tool TPI Plugin For HashiCorp's Terraform Cloud Infrastructure Service
    submitted by /u/thumbsdrivesmecrazy [link] [comments]
    Exciting Data Science Project Ideas To Brush Up Your Skills
    Understanding Data Science can be quite confusing at first, but with constant practice, you can soon begin to grasp the various notions and terminologies in the subject. The best way to gain more exposure to Data Science apart from going through the literature is to take on some helpful projects which will not only upskill you but will also make your resume more impressive. Here are some new-fangled data science project ideas that zealous data science professionals can pick from: https://betterprogramming.pub/exciting-data-science-project-ideas-to-brush-up-your-skills-54475993d413 submitted by /u/saik2363 [link] [comments]  ( 1 min )
    AI in cybersecurity - for good and bad
    Would love to hear any thoughts on AI and ML in cybersecurity, whether helping attackers or defenders - anyone work in CS? I just published a podcast about artificial intelligence and its growing influence on cybersecurity, both as an adversarial tool (deepfakes, social engineering etc.) and as a way to detect anomalies and protect websites etc. The guest is Elaine Lee, a data scientist working at Mimecast to use AI to stop spam and phishing attacks. https://open.spotify.com/episode/2BG4c0qyXJsjvOE5nqifhe?si=a318ca529e164174 submitted by /u/AgentLessBots [link] [comments]  ( 1 min )
    Any free chat bots that have URL endpoints to get their responses?
    submitted by /u/antoniscool28 [link] [comments]  ( 1 min )
    Have any researchers in the field discussed anything about the prospect of 'text-to-video' - something that's a bit like DALL-E 2, but with a video as the finished output?
    I'm wondering if any of the researchers who are in this field have spoken about the idea of (eventually) creating something where you would end up with some kind of short video at the end of it. submitted by /u/brick_eater [link] [comments]  ( 1 min )
    Support for macOS Added in MindSpore 1.6
    submitted by /u/Creative_Habit_6868 [link] [comments]  ( 5 min )
  • Open

    Learning Locomotion Skills Safely in the Real World
    Posted by Jimmy (Tsung-Yen) Yang, Student Researcher, Robotics at Google The promise of deep reinforcement learning (RL) in solving complex, high-dimensional problems autonomously has attracted much interest in areas such as robotics, game playing, and self-driving cars. However, effectively training an RL policy requires exploring a large set of robot states and actions, including many that are not safe for the robot. This is a considerable risk, for example, when training a legged robot. Because such robots are inherently unstable, there is a high likelihood of the robot falling during learning, which could cause damage. The risk of damage can be mitigated to some extent by learning the control policy in computer simulation and then deploying it in the real world. However, this approac…  ( 9 min )
  • Open

    Driver’s Ed: How Waabi Uses AI, Simulation to Teach Autonomous Vehicles to Drive
    Teaching the AI brains of autonomous vehicles to understand the world as humans do requires billions of miles of driving experience. The road to achieving this astronomical level of driving leads to the virtual world. On the latest episode of the AI Podcast, Waabi CEO and founder Raquel Urtasun joins NVIDIA’s Katie Burke Washabaugh to Read article > The post Driver’s Ed: How Waabi Uses AI, Simulation to Teach Autonomous Vehicles to Drive appeared first on NVIDIA Blog.  ( 2 min )
    GFN Thursday Caught in 4K: 27 Games Arriving on GeForce NOW in May, Alongside 4K Streaming to PC and Mac Apps
    Enjoy the finer things in life. May is looking pixel perfect for GeForce NOW gamers. RTX 3080 members can now take their games to the next level, streaming at 4K resolution on the GeForce NOW PC and Mac native apps — joining 4K support in the living room with SHIELD TV. There’s also a list Read article > The post GFN Thursday Caught in 4K: 27 Games Arriving on GeForce NOW in May, Alongside 4K Streaming to PC and Mac Apps appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Setting Breakpoints and Exception Hooks in Python
    There are different ways of debugging code in Python, one of which is to introduce breakpoints into the code at […] The post Setting Breakpoints and Exception Hooks in Python appeared first on Machine Learning Mastery.  ( 14 min )
  • Open

    Azure Quantum innovation: Efficient error correction of topological qubits with Floquet codes
    Technological innovation that enables scaling of quantum computing underpins the Microsoft Azure Quantum program. In March of this year, we announced our demonstration of the underlying physics required to create a topological qubit—qubits that are theorized to be inherently more stable than existing ones without sacrificing size or speed. However, our quest to deliver a […] The post Azure Quantum innovation: Efficient error correction of topological qubits with Floquet codes appeared first on Microsoft Research.  ( 7 min )
  • Open

    Implementing an AI project the right way: Here’s how it works
    Do you want to reduce costs and introduce more efficient workflows in your company? Then you may have thought about using artificial…  ( 5 min )
  • Open

    DSC Weekly Newsletter 03 May 2022: How Many Meetings Do We Need?
    One of the more frustrating side effects of the Long Pandemic has been the rise in virtual meetings. I recently was talking with a colleague when the subject of meetings came up. “I swear that I spend much of most days in meetings,” she bemoaned. “It wouldn’t be so bad, but so many of them… Read More »DSC Weekly Newsletter 03 May 2022: How Many Meetings Do We Need? The post DSC Weekly Newsletter 03 May 2022: How Many Meetings Do We Need? appeared first on Data Science Central.  ( 7 min )
  • Open

    Unpacking black-box models
    Researchers create a mathematical framework to evaluate explanations of machine-learning models and quantify how well people understand them.  ( 6 min )
  • Open

    John Conway and mental exercise rituals
    John Horton Conway (1937–2020) came up with an algorithm in 1973 for mentally calculating what day of the week a date falls on. His method, which he called the “Doomsday rule” starts from the observation that every year, the dates 4/4. 6/6, 8/8, 10/10, 12/12, 5/9, 9/5, 7/11, and 11/7 fall on the same day […] John Conway and mental exercise rituals first appeared on John D. Cook.  ( 3 min )
  • Open

    The Grammar of Interactive Explanatory Model Analysis. (arXiv:2005.00497v4 [cs.LG] UPDATED)
    The growing need for in-depth analysis of predictive models leads to a series of new methods for explaining their local and global properties. Which of these methods is the best? It turns out that this is an ill-posed question. One cannot sufficiently explain a black-box machine learning model using a single method that gives only one perspective. Isolated explanations are prone to misunderstanding, leading to wrong or simplistic reasoning. This problem is known as the Rashomon effect and refers to diverse, even contradictory, interpretations of the same phenomenon. Surprisingly, most methods developed for explainable and responsible machine learning focus on a single-aspect of the model behavior. In contrast, we showcase the problem of explainability as an interactive and sequential analysis of a model. This paper proposes how different Explanatory Model Analysis (EMA) methods complement each other and discusses why it is essential to juxtapose them. The introduced process of Interactive EMA (IEMA) derives from the algorithmic side of explainable machine learning and aims to embrace ideas developed in cognitive sciences. We formalize the grammar of IEMA to describe potential human-model dialogues. It is implemented in a widely used human-centered open-source software framework that adopts interactivity, customizability and automation as its main traits. We conduct a user study to evaluate the usefulness of IEMA, which indicates that an interactive sequential analysis of a model increases the performance and confidence of human decision making.
    Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification. (arXiv:2205.02151v1 [cs.CV])
    Recently, self-attention mechanisms have shown impressive performance in various NLP and CV tasks, which can help capture sequential characteristics and derive global information. In this work, we explore how to extend self-attention modules to better learn subtle feature embeddings for recognizing fine-grained objects, e.g., different bird species or person identities. To this end, we propose a dual cross-attention learning (DCAL) algorithm to coordinate with self-attention learning. First, we propose global-local cross-attention (GLCA) to enhance the interactions between global images and local high-response regions, which can help reinforce the spatial-wise discriminative clues for recognition. Second, we propose pair-wise cross-attention (PWCA) to establish the interactions between image pairs. PWCA can regularize the attention learning of an image by treating another image as distractor and will be removed during inference. We observe that DCAL can reduce misleading attentions and diffuse the attention response to discover more complementary parts for recognition. We conduct extensive evaluations on fine-grained visual categorization and object re-identification. Experiments demonstrate that DCAL performs on par with state-of-the-art methods and consistently improves multiple self-attention baselines, e.g., surpassing DeiT-Tiny and ViT-Base by 2.8% and 2.4% mAP on MSMT17, respectively.
    Approximation of Images via Generalized Higher Order Singular Value Decomposition over Finite-dimensional Commutative Semisimple Algebra. (arXiv:2202.00450v4 [cs.LG] UPDATED)
    Low-rank approximation of images via singular value decomposition is well-received in the era of big data. However, singular value decomposition (SVD) is only for order-two data, i.e., matrices. It is necessary to flatten a higher order input into a matrix or break it into a series of order-two slices to tackle higher order data such as multispectral images and videos with the SVD. Higher order singular value decomposition (HOSVD) extends the SVD and can approximate higher order data using sums of a few rank-one components. We consider the problem of generalizing HOSVD over a finite dimensional commutative algebra. This algebra, referred to as a t-algebra, generalizes the field of complex numbers. The elements of the algebra, called t-scalars, are fix-sized arrays of complex numbers. One can generalize matrices and tensors over t-scalars and then extend many canonical matrix and tensor algorithms, including HOSVD, to obtain higher-performance versions. The generalization of HOSVD is called THOSVD. Its performance of approximating multi-way data can be further improved by an alternating algorithm. THOSVD also unifies a wide range of principal component analysis algorithms. To exploit the potential of generalized algorithms using t-scalars for approximating images, we use a pixel neighborhood strategy to convert each pixel to "deeper-order" t-scalar. Experiments on publicly available images show that the generalized algorithm over t-scalars, namely THOSVD, compares favorably with its canonical counterparts.
    Axonal Delay As a Short-Term Memory for Feed Forward Deep Spiking Neural Networks. (arXiv:2205.02115v1 [cs.NE])
    The information of spiking neural networks (SNNs) are propagated between the adjacent biological neuron by spikes, which provides a computing paradigm with the promise of simulating the human brain. Recent studies have found that the time delay of neurons plays an important role in the learning process. Therefore, configuring the precise timing of the spike is a promising direction for understanding and improving the transmission process of temporal information in SNNs. However, most of the existing learning methods for spiking neurons are focusing on the adjustment of synaptic weight, while very few research has been working on axonal delay. In this paper, we verify the effectiveness of integrating time delay into supervised learning and propose a module that modulates the axonal delay through short-term memory. To this end, a rectified axonal delay (RAD) module is integrated with the spiking model to align the spike timing and thus improve the characterization learning ability of temporal features. Experiments on three neuromorphic benchmark datasets : NMNIST, DVS Gesture and N-TIDIGITS18 show that the proposed method achieves the state-of-the-art performance while using the fewest parameters.
    Semi-Supervised Cascaded Clustering for Classification of Noisy Label Data. (arXiv:2205.02209v1 [cs.LG])
    The performance of supervised classification techniques often deteriorates when the data has noisy labels. Even the semi-supervised classification approaches have largely focused only on the problem of handling missing labels. Most of the approaches addressing the noisy label data rely on deep neural networks (DNN) that require huge datasets for classification tasks. This poses a serious challenge especially in process and manufacturing industries, where the data is limited and labels are noisy. We propose a semi-supervised cascaded clustering (SSCC) algorithm to extract patterns and generate a cascaded tree of classes in such datasets. A novel cluster evaluation matrix (CEM) with configurable hyperparameters is introduced to localize and eliminate the noisy labels and invoke a pruning criterion on cascaded clustering. The algorithm reduces the dependency on expensive human expertise for assessing the accuracy of labels. A classifier generated based on SSCC is found to be accurate and consistent even when trained on noisy label datasets. It performed better in comparison with the support vector machines (SVM) when tested on multiple noisy-label datasets, including an industrial dataset. The proposed approach can be effectively used for deriving actionable insights in industrial settings with minimal human expertise.
    Learning the temporal evolution of multivariate densities via normalizing flows. (arXiv:2107.13735v2 [stat.ML] UPDATED)
    In this work, we propose a method to learn multivariate probability distributions using sample path data from stochastic differential equations. Specifically, we consider temporally evolving probability distributions (e.g., those produced by integrating local or nonlocal Fokker-Planck equations). We analyze this evolution through machine learning assisted construction of a time-dependent mapping that takes a reference distribution (say, a Gaussian) to each and every instance of our evolving distribution. If the reference distribution is the initial condition of a Fokker-Planck equation, what we learn is the time-T map of the corresponding solution. Specifically, the learned map is a multivariate normalizing flow that deforms the support of the reference density to the support of each and every density snapshot in time. We demonstrate that this approach can approximate probability density function evolutions in time from observed sampled data for systems driven by both Brownian and L\'evy noise. We present examples with two- and three-dimensional, uni- and multimodal distributions to validate the method.
    Deep Reinforcement Learning-Based Long-Range Autonomous Valet Parking for Smart Cities. (arXiv:2109.11661v3 [cs.LG] UPDATED)
    In this paper, to reduce the congestion rate at the city center and increase the quality of experience (QoE) of each user, the framework of long-range autonomous valet parking (LAVP) is presented, where an Autonomous Vehicle (AV) is deployed in the city, which can pick up, drop off users at their required spots, and then drive to the car park out of city center autonomously. In this framework, we aim to minimize the overall distance of the AV, while guarantee all users are served, i.e., picking up, and dropping off users at their required spots through optimizing the path planning of the AV and number of serving time slots. To this end, we first propose a learning based algorithm, which is named as Double-Layer Ant Colony Optimization (DL-ACO) algorithm to solve the above problem in an iterative way. Then, to make the real-time decision, while consider the dynamic environment (i.e., the AV may pick up and drop off users from different locations), we further present a deep reinforcement learning (DRL) based algorithm, which is known as deep Q network (DQN). The experimental results show that the DL-ACO and DQN-based algorithms both achieve the considerable performance.
    Microgrid Day-Ahead Scheduling Considering Neural Network based Battery Degradation Model. (arXiv:2202.12416v2 [eess.SP] UPDATED)
    Battery energy storage system (BESS) can effectively mitigate the uncertainty of variable renewable generation. Degradation is un-preventable for batteries such as the most popular Lithium-ion battery (LiB). The main causes of LiB degradation are loss of Li-ions, loss of electrolyte, and increase of internal resistance which are hard to model and predict. In this paper, we propose a data driven method to predict the battery degradation per a given scheduled battery operational profile. Particularly, a neural net-work based battery degradation (NNBD) model is proposed to quantify the battery degradation with inputs of major battery degradation factors. When incorporating the proposed NNBD model into microgrid day-ahead scheduling (MDS), we can estab-lish a battery degradation based MDS (BDMDS) model that can consider the equivalent battery degradation cost precisely. Since the proposed NNBD model is highly non-linear and non-convex, BDMDS would be very hard to solve. To address this issue, a neural network and optimization decoupled heuristic (NNODH) algorithm is proposed in this paper to effectively solve this neural network embedded optimization problem. Simulation results demonstrate that the proposed NNODH algorithm is able to ob-tain the optimal solution with lowest total cost including normal operation cost and battery degradation cost.
    Sequencer: Deep LSTM for Image Classification. (arXiv:2205.01972v1 [cs.CV])
    In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision. Here we propose Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers. We also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6\% top-1 accuracy on only ImageNet-1K. Not only that, we show that it has good transferability and the robust resolution adaptability on double resolution-band.
    Compound virtual screening by learning-to-rank with gradient boosting decision tree and enrichment-based cumulative gain. (arXiv:2205.02169v1 [q-bio.BM])
    Learning-to-rank, a machine learning technique widely used in information retrieval, has recently been applied to the problem of ligand-based virtual screening, to accelerate the early stages of new drug development. Ranking prediction models learn based on ordinal relationships, making them suitable for integrating assay data from various environments. Existing studies of rank prediction in compound screening have generally used a learning-to-rank method called RankSVM. However, they have not been compared with or validated against the gradient boosting decision tree (GBDT)-based learning-to-rank methods that have gained popularity recently. Furthermore, although the ranking metric called Normalized Discounted Cumulative Gain (NDCG) is widely used in information retrieval, it only determines whether the predictions are better than those of other models. In other words, NDCG is incapable of recognizing when a prediction model produces worse than random results. Nevertheless, NDCG is still used in the performance evaluation of compound screening using learning-to-rank. This study used the GBDT model with ranking loss functions, called lambdarank and lambdaloss, for ligand-based virtual screening; results were compared with existing RankSVM methods and GBDT models using regression. We also proposed a new ranking metric, Normalized Enrichment Discounted Cumulative Gain (NEDCG), which aims to properly evaluate the goodness of ranking predictions. Results showed that the GBDT model with learning-to-rank outperformed existing regression methods using GBDT and RankSVM on diverse datasets. Moreover, NEDCG showed that predictions by regression were comparable to random predictions in multi-assay, multi-family datasets, demonstrating its usefulness for a more direct assessment of compound screening performance.
    Learning Mechanically Driven Emergent Behavior with Message Passing Neural Networks. (arXiv:2202.01380v2 [cs.LG] UPDATED)
    From designing architected materials to connecting mechanical behavior across scales, computational modeling is a critical tool in solid mechanics. Recently, there has been a growing interest in using machine learning to reduce the computational cost of physics-based simulations. Notably, while machine learning approaches that rely on Graph Neural Networks (GNNs) have shown success in learning mechanics, the performance of GNNs has yet to be investigated on a myriad of solid mechanics problems. In this work, we examine the ability of GNNs to predict a fundamental aspect of mechanically driven emergent behavior: the connection between a column's geometric structure and the direction that it buckles. To accomplish this, we introduce the Asymmetric Buckling Columns (ABC) dataset, a dataset comprised of three sub-datasets of asymmetric and heterogeneous column geometries where the goal is to classify the direction of symmetry breaking (left or right) under compression after the onset of instability. Because of complex local geometry, the "image-like" data representations required for implementing standard convolutional neural network based metamodels are not ideal, thus motivating the use of GNNs. In addition to investigating GNN model architecture, we study the effect of different input data representation approaches, data augmentation, and combining multiple models as an ensemble. While we were able to obtain good results, we also showed that predicting solid mechanics based emergent behavior is non-trivial. Because both our model implementation and dataset are distributed under open-source licenses, we hope that future researchers can build on our work to create enhanced mechanics-specific machine learning pipelines for capturing the behavior of complex geometric structures.
    Analysis of Temporal Difference Learning: Linear System Approach. (arXiv:2204.10479v3 [cs.LG] UPDATED)
    The goal of this technical note is to introduce a new finite-time convergence analysis of temporal difference (TD) learning based on stochastic linear system models. TD-learning is a fundamental reinforcement learning (RL) to evaluate a given policy by estimating the corresponding value function for a Markov decision process. While there has been a series of successful works in theoretical analysis of TDlearning, it was not until recently that researchers found some guarantees on its statistical efficiency by developing finite-time error bounds. In this paper, we propose a simple control theoretic finite-time analysis of TD-learning, which exploits linear system models and standard notions in linear system communities. The proposed work provides new simple templets for RL analysis, and additional insights on TD-learning and RL based on ideas in control theory.
    Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. (arXiv:2201.05989v2 [cs.CV] UPDATED)
    Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate. We reduce this cost with a versatile new input encoding that permits the use of a smaller network without sacrificing quality, thus significantly reducing the number of floating point and memory access operations: a small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through stochastic gradient descent. The multiresolution structure allows the network to disambiguate hash collisions, making for a simple architecture that is trivial to parallelize on modern GPUs. We leverage this parallelism by implementing the whole system using fully-fused CUDA kernels with a focus on minimizing wasted bandwidth and compute operations. We achieve a combined speedup of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds, and rendering in tens of milliseconds at a resolution of ${1920\!\times\!1080}$.
    Modeling Task Interactions in Document-Level Joint Entity and Relation Extraction. (arXiv:2205.01909v1 [cs.CL])
    We target on the document-level relation extraction in an end-to-end setting, where the model needs to jointly perform mention extraction, coreference resolution (COREF) and relation extraction (RE) at once, and gets evaluated in an entity-centric way. Especially, we address the two-way interaction between COREF and RE that has not been the focus by previous work, and propose to introduce explicit interaction namely Graph Compatibility (GC) that is specifically designed to leverage task characteristics, bridging decisions of two tasks for direct task interference. Our experiments are conducted on DocRED and DWIE; in addition to GC, we implement and compare different multi-task settings commonly adopted in previous work, including pipeline, shared encoders, graph propagation, to examine the effectiveness of different interactions. The result shows that GC achieves the best performance by up to 2.3/5.1 F1 improvement over the baseline.
    Assessing Dataset Bias in Computer Vision. (arXiv:2205.01811v1 [cs.CV])
    A biased dataset is a dataset that generally has attributes with an uneven class distribution. These biases have the tendency to propagate to the models that train on them, often leading to a poor performance in the minority class. In this project, we will explore the extent to which various data augmentation methods alleviate intrinsic biases within the dataset. We will apply several augmentation techniques on a sample of the UTKFace dataset, such as undersampling, geometric transformations, variational autoencoders (VAEs), and generative adversarial networks (GANs). We then trained a classifier for each of the augmented datasets and evaluated their performance on the native test set and on external facial recognition datasets. We have also compared their performance to the state-of-the-art attribute classifier trained on the FairFace dataset. Through experimentation, we were able to find that training the model on StarGAN-generated images led to the best overall performance. We also found that training on geometrically transformed images lead to a similar performance with a much quicker training time. Additionally, the best performing models also exhibit a uniform performance across the classes within each attribute. This signifies that the model was also able to mitigate the biases present in the baseline model that was trained on the original training set. Finally, we were able to show that our model has a better overall performance and consistency on age and ethnicity classification on multiple datasets when compared with the FairFace model. Our final model has an accuracy on the UTKFace test set of 91.75%, 91.30%, and 87.20% for the gender, age, and ethnicity attribute respectively, with a standard deviation of less than 0.1 between the accuracies of the classes of each attribute.
    Signal Decomposition Using Masked Proximal Operators. (arXiv:2202.09338v4 [cs.LG] UPDATED)
    We consider the well-studied problem of decomposing a vector time series signal into components with different characteristics, such as smooth, periodic, nonnegative, or sparse. We propose a simple and general framework in which the components are defined by loss functions (which include constraints), and the signal decomposition is carried out by minimizing the sum of losses of the components (subject to the constraints). When each loss function is the negative log-likelihood of a density for the signal component, our method coincides with maximum a posteriori probability (MAP) estimation; but it also includes many other interesting cases. We give two distributed optimization methods for computing the decomposition, which find the optimal decomposition when the component class loss functions are convex, and are good heuristics when they are not. Both methods require only the masked proximal operator of each of the component loss functions, a generalization of the well-known proximal operator that handles missing entries in its argument. Both methods are distributed, i.e., handle each component separately. We derive tractable methods for evaluating the masked proximal operators of some loss functions that, to our knowledge, have not appeared in the literature.  ( 2 min )
    TracInAD: Measuring Influence for Anomaly Detection. (arXiv:2205.01362v2 [cs.LG] UPDATED)
    As with many other tasks, neural networks prove very effective for anomaly detection purposes. However, very few deep-learning models are suited for detecting anomalies on tabular datasets. This paper proposes a novel methodology to flag anomalies based on TracIn, an influence measure initially introduced for explicability purposes. The proposed methods can serve to augment any unsupervised deep anomaly detection method. We test our approach using Variational Autoencoders and show that the average influence of a subsample of training points on a test point can serve as a proxy for abnormality. Our model proves to be competitive in comparison with state-of-the-art approaches: it achieves comparable or better performance in terms of detection accuracy on medical and cyber-security tabular benchmark data.
    Better Parameter-free Stochastic Optimization with ODE Updates for Coin-Betting. (arXiv:2006.07507v3 [cs.LG] UPDATED)
    Parameter-free stochastic gradient descent (PFSGD) algorithms do not require setting learning rates while achieving optimal theoretical performance. In practical applications, however, there remains an empirical gap between tuned stochastic gradient descent (SGD) and PFSGD. In this paper, we close the empirical gap with a new parameter-free algorithm based on continuous-time Coin-Betting on truncated models. The new update is derived through the solution of an Ordinary Differential Equation (ODE) and solved in a closed form. We show empirically that this new parameter-free algorithm outperforms algorithms with the "best default" learning rates and almost matches the performance of finely tuned baselines without anything to tune.  ( 2 min )
    ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints. (arXiv:2202.11271v2 [cs.RO] UPDATED)
    Robotic navigation has been approached as a problem of 3D reconstruction and planning, as well as an end-to-end learning problem. However, long-range navigation requires both planning and reasoning about local traversability, as well as being able to utilize general knowledge about global geography, in the form of a roadmap, GPS, or other side information providing important cues. In this work, we propose an approach that integrates learning and planning, and can utilize side information such as schematic roadmaps, satellite maps and GPS coordinates as a planning heuristic, without relying on them being accurate. Our method, ViKiNG, incorporates a local traversability model, which looks at the robot's current camera observation and a potential subgoal to infer how easily that subgoal can be reached, as well as a heuristic model, which looks at overhead maps for hints and attempts to evaluate the appropriateness of these subgoals in order to reach the goal. These models are used by a heuristic planner to identify the best waypoint in order to reach the final destination. Our method performs no explicit geometric reconstruction, utilizing only a topological representation of the environment. Despite having never seen trajectories longer than 80 meters in its training dataset, ViKiNG can leverage its image-based learned controller and goal-directed heuristic to navigate to goals up to 3 kilometers away in previously unseen environments, and exhibit complex behaviors such as probing potential paths and backtracking when they are found to be non-viable. ViKiNG is also robust to unreliable maps and GPS, since the low-level controller ultimately makes decisions based on egocentric image observations, using maps only as planning heuristics. For videos of our experiments, please check out our project page https://sites.google.com/view/viking-release.  ( 3 min )
    B\'ezier Curve Gaussian Processes. (arXiv:2205.01754v1 [stat.ML])
    Probabilistic models for sequential data are the basis for a variety of applications concerned with processing timely ordered information. The predominant approach in this domain is given by neural networks, which incorporate either stochastic units or components. This paper proposes a new probabilistic sequence model building on probabilistic B\'ezier curves. Using Gaussian distributed control points, these parametric curves pose a special case for Gaussian processes (GP). Combined with a Mixture Density network, Bayesian conditional inference can be performed without the need for mean field variational approximation or Monte Carlo simulation, which is a requirement of common approaches. For assessing this hybrid model's viability, it is applied to an exemplary sequence prediction task. In this case the model is used for pedestrian trajectory prediction, where a generated prediction also serves as a GP prior. Following this, the initial prediction can be refined using the GP framework by calculating different posterior distributions, in order to adapt more towards a given observed trajectory segment.
    i-Code: An Integrative and Composable Multimodal Learning Framework. (arXiv:2205.01818v1 [cs.LG])
    Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel attention mechanisms and other architectural innovations to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.
    Lifelong Ensemble Learning based on Multiple Representations for Few-Shot Object Recognition. (arXiv:2205.01982v1 [cs.RO])
    Service robots are integrating more and more into our daily lives to help us with various tasks. In such environments, robots frequently face new objects while working in the environment and need to learn them in an open-ended fashion. Furthermore, such robots must be able to recognize a wide range of object categories. In this paper, we present a lifelong ensemble learning approach based on multiple representations to address the few-shot object recognition problem. In particular, we form ensemble methods based on deep representations and handcrafted 3D shape descriptors. To facilitate lifelong learning, each approach is equipped with a memory unit for storing and retrieving object information instantly. The proposed model is suitable for open-ended learning scenarios where the number of 3D object categories is not fixed and can grow over time. We have performed extensive sets of experiments to assess the performance of the proposed approach in offline, and open-ended scenarios. For the evaluation purpose, in addition to real object datasets, we generate a large synthetic household objects dataset consisting of 27000 views of 90 objects. Experimental results demonstrate the effectiveness of the proposed method on 3D object recognition tasks, as well as its superior performance over the state-of-the-art approaches. Additionally, we demonstrated the effectiveness of our approach in both simulated and real-robot settings, where the robot rapidly learned new categories from limited examples.
    Splicing Detection and Localization In Satellite Imagery Using Conditional GANs. (arXiv:2205.01805v1 [cs.CV])
    The widespread availability of image editing tools and improvements in image processing techniques allow image manipulation to be very easy. Oftentimes, easy-to-use yet sophisticated image manipulation tools yields distortions/changes imperceptible to the human observer. Distribution of forged images can have drastic ramifications, especially when coupled with the speed and vastness of the Internet. Therefore, verifying image integrity poses an immense and important challenge to the digital forensic community. Satellite images specifically can be modified in a number of ways, including the insertion of objects to hide existing scenes and structures. In this paper, we describe the use of a Conditional Generative Adversarial Network (cGAN) to identify the presence of such spliced forgeries within satellite images. Additionally, we identify their locations and shapes. Trained on pristine and falsified images, our method achieves high success on these detection and localization objectives.
    Adversarial Training for High-Stakes Reliability. (arXiv:2205.01663v2 [cs.LG] UPDATED)
    In the future, powerful AI systems may be deployed in high-stakes settings, where a single failure could be catastrophic. One technique for improving AI safety in high-stakes settings is adversarial training, which uses an adversary to generate examples to train on in order to achieve better worst-case performance. In this work, we used a language generation task as a testbed for achieving high reliability through adversarial training. We created a series of adversarial training techniques -- including a tool that assists human adversaries -- to find and eliminate failures in a classifier that filters text completions suggested by a generator. In our simple "avoid injuries" task, we determined that we can set very conservative classifier thresholds without significantly impacting the quality of the filtered outputs. With our chosen thresholds, filtering with our baseline classifier decreases the rate of unsafe completions from about 2.4% to 0.003% on in-distribution data, which is near the limit of our ability to measure. We found that adversarial training significantly increased robustness to the adversarial attacks that we trained on, without affecting in-distribution performance. We hope to see further work in the high-stakes reliability setting, including more powerful tools for enhancing human adversaries and better ways to measure high levels of reliability, until we can confidently rule out the possibility of catastrophic deployment-time failures of powerful models.
    Generalized Reference Kernel for One-class Classification. (arXiv:2205.00534v2 [cs.LG] UPDATED)
    In this paper, we formulate a new generalized reference kernel hoping to improve the original base kernel using a set of reference vectors. Depending on the selected reference vectors, our formulation shows similarities to approximate kernels, random mappings, and Non-linear Projection Trick. Focusing on small-scale one-class classification, our analysis and experimental results show that the new formulation provides approaches to regularize, adjust the rank, and incorporate additional information into the kernel itself, leading to improved one-class classification accuracy.  ( 2 min )
    Brainish: Formalizing A Multimodal Language for Intelligence and Consciousness. (arXiv:2205.00001v2 [cs.AI] UPDATED)
    Having a rich multimodal inner language is an important component of human intelligence that enables several necessary core cognitive functions such as multimodal prediction, translation, and generation. Building upon the Conscious Turing Machine (CTM), a machine model for consciousness proposed by Blum and Blum (2021), we describe the desiderata of a multimodal language called Brainish, comprising words, images, audio, and sensations combined in representations that the CTM's processors use to communicate with each other. We define the syntax and semantics of Brainish before operationalizing this language through the lens of multimodal artificial intelligence, a vibrant research area studying the computational tools necessary for processing and relating information from heterogeneous signals. Our general framework for learning Brainish involves designing (1) unimodal encoders to segment and represent unimodal data, (2) a coordinated representation space that relates and composes unimodal features to derive holistic meaning across multimodal inputs, and (3) decoders to map multimodal representations into predictions (for fusion) or raw data (for translation or generation). Through discussing how Brainish is crucial for communication and coordination in order to achieve consciousness in the CTM, and by implementing a simple version of Brainish and evaluating its capability of demonstrating intelligence on multimodal prediction and retrieval tasks on several real-world image, text, and audio datasets, we argue that such an inner language will be important for advances in machine models of intelligence and consciousness.  ( 2 min )
    Sparse Representations of Positive Functions via First and Second-Order Pseudo-Mirror Descent. (arXiv:2011.07142v4 [stat.ML] UPDATED)
    We consider expected risk minimization problems when the range of the estimator is required to be nonnegative, motivated by the settings of maximum likelihood estimation (MLE) and trajectory optimization. To facilitate nonlinear interpolation, we hypothesize that the search space is a Reproducing Kernel Hilbert Space (RKHS). We develop first and second-order variants of stochastic mirror descent employing (i) \emph{pseudo-gradients} and (ii) complexity-reducing projections. Compressive projection in the first-order scheme is executed via kernel orthogonal matching pursuit (KOMP), which overcomes the fact that the vanilla RKHS parameterization grows unbounded with the iteration index in the stochastic setting. Moreover, pseudo-gradients are needed when gradient estimates for cost are only computable up to some numerical error, which arise in, e.g., integral approximations. Under constant step-size and compression budget, we establish tradeoffs between the radius of convergence of the expected sub-optimality and the projection budget parameter, as well as non-asymptotic bounds on the model complexity. To refine the solution's precision, we develop a second-order extension which employs recursively averaged pseudo-gradient outer-products to approximate the Hessian inverse, whose convergence in mean is established under an additional eigenvalue decay condition on the Hessian of the optimal RKHS element, which is unique to this work. Experiments demonstrate favorable performance on inhomogeneous Poisson Process intensity estimation in practice.
    Wavelet neural operator: a neural operator for parametric partial differential equations. (arXiv:2205.02191v1 [physics.comp-ph])
    With massive advancements in sensor technologies and Internet-of-things, we now have access to terabytes of historical data; however, there is a lack of clarity in how to best exploit the data to predict future events. One possible alternative in this context is to utilize operator learning algorithm that directly learn nonlinear mapping between two functional spaces; this facilitates real-time prediction of naturally arising complex evolutionary dynamics. In this work, we introduce a novel operator learning algorithm referred to as the Wavelet Neural Operator (WNO) that blends integral kernel with wavelet transformation. WNO harnesses the superiority of the wavelets in time-frequency localization of the functions and enables accurate tracking of patterns in spatial domain and effective learning of the functional mappings. Since the wavelets are localized in both time/space and frequency, WNO can provide high spatial and frequency resolution. This offers learning of the finer details of the parametric dependencies in the solution for complex problems. The efficacy and robustness of the proposed WNO are illustrated on a wide array of problems involving Burger's equation, Darcy flow, Navier-Stokes equation, Allen-Cahn equation, and Wave advection equation. Comparative study with respect to existing operator learning frameworks are presented. Finally, the proposed approach is used to build a digital twin capable of predicting Earth's air temperature based on available historical data.
    SMLT: A Serverless Framework for Scalable and Adaptive Machine Learning Design and Training. (arXiv:2205.01853v1 [cs.DC])
    In today's production machine learning (ML) systems, models are continuously trained, improved, and deployed. ML design and training are becoming a continuous workflow of various tasks that have dynamic resource demands. Serverless computing is an emerging cloud paradigm that provides transparent resource management and scaling for users and has the potential to revolutionize the routine of ML design and training. However, hosting modern ML workflows on existing serverless platforms has non-trivial challenges due to their intrinsic design limitations such as stateless nature, limited communication support across function instances, and limited function execution duration. These limitations result in a lack of an overarching view and adaptation mechanism for training dynamics and an amplification of existing problems in ML workflows. To address the above challenges, we propose SMLT, an automated, scalable, and adaptive serverless framework to enable efficient and user-centric ML design and training. SMLT employs an automated and adaptive scheduling mechanism to dynamically optimize the deployment and resource scaling for ML tasks during training. SMLT further enables user-centric ML workflow execution by supporting user-specified training deadlines and budget limits. In addition, by providing an end-to-end design, SMLT solves the intrinsic problems in serverless platforms such as the communication overhead, limited function execution duration, need for repeated initialization, and also provides explicit fault tolerance for ML training. SMLT is open-sourced and compatible with all major ML frameworks. Our experimental evaluation with large, sophisticated modern ML models demonstrate that SMLT outperforms the state-of-the-art VM based systems and existing serverless ML training frameworks in both training speed (up to 8X) and monetary cost (up to 3X)
    Stochastic Coded Federated Learning with Convergence and Privacy Guarantees. (arXiv:2201.10092v4 [cs.LG] UPDATED)
    Federated learning (FL) has attracted much attention as a privacy-preserving distributed machine learning framework, where many clients collaboratively train a machine learning model by exchanging model updates with a parameter server instead of sharing their raw data. Nevertheless, FL training suffers from slow convergence and unstable performance due to stragglers caused by the heterogeneous computational resources of clients and fluctuating communication rates. This paper proposes a coded FL framework to mitigate the straggler issue, namely stochastic coded federated learning (SCFL). In this framework, each client generates a privacy-preserving coded dataset by adding additive noise to the random linear combination of its local data. The server collects the coded datasets from all the clients to construct a composite dataset, which helps to compensate for the straggling effect. In the training process, the server as well as clients perform mini-batch stochastic gradient descent (SGD), and the server adds a make-up term in model aggregation to obtain unbiased gradient estimates. We characterize the privacy guarantee by the mutual information differential privacy (MI-DP) and analyze the convergence performance in federated learning. Besides, we demonstrate a privacy-performance tradeoff of the proposed SCFL method by analyzing the influence of the privacy constraint on the convergence rate. Finally, numerical experiments corroborate our analysis and show the benefits of SCFL in achieving fast convergence while preserving data privacy.  ( 2 min )
    Explain to Not Forget: Defending Against Catastrophic Forgetting with XAI. (arXiv:2205.01929v1 [cs.LG])
    The ability to continuously process and retain new information like we do naturally as humans is a feat that is highly sought after when training neural networks. Unfortunately, the traditional optimization algorithms often require large amounts of data available during training time and updates wrt. new data are difficult after the training process has been completed. In fact, when new data or tasks arise, previous progress may be lost as neural networks are prone to catastrophic forgetting. Catastrophic forgetting describes the phenomenon when a neural network completely forgets previous knowledge when given new information. We propose a novel training algorithm called training by explaining in which we leverage Layer-wise Relevance Propagation in order to retain the information a neural network has already learned in previous tasks when training on new data. The method is evaluated on a range of benchmark datasets as well as more complex data. Our method not only successfully retains the knowledge of old tasks within the neural networks but does so more resource-efficiently than other state-of-the-art solutions.
    Exploring Rawlsian Fairness for K-Means Clustering. (arXiv:2205.02052v1 [cs.LG])
    We conduct an exploratory study that looks at incorporating John Rawls' ideas on fairness into existing unsupervised machine learning algorithms. Our focus is on the task of clustering, specifically the k-means clustering algorithm. To the best of our knowledge, this is the first work that uses Rawlsian ideas in clustering. Towards this, we attempt to develop a postprocessing technique i.e., one that operates on the cluster assignment generated by the standard k-means clustering algorithm. Our technique perturbs this assignment over a number of iterations to make it fairer according to Rawls' difference principle while minimally affecting the overall utility. As the first step, we consider two simple perturbation operators -- $\mathbf{R_1}$ and $\mathbf{R_2}$ -- that reassign examples in a given cluster assignment to new clusters; $\mathbf{R_1}$ assigning a single example to a new cluster, and $\mathbf{R_2}$ a pair of examples to new clusters. Our experiments on a sample of the Adult dataset demonstrate that both operators make meaningful perturbations in the cluster assignment towards incorporating Rawls' difference principle, with $\mathbf{R_2}$ being more efficient than $\mathbf{R_1}$ in terms of the number of iterations. However, we observe that there is still a need to design operators that make significantly better perturbations. Nevertheless, both operators provide good baselines for designing and comparing any future operator, and we hope our findings would aid future work in this direction.
    Depth Uncertainty Networks for Active Learning. (arXiv:2112.06796v2 [cs.LG] UPDATED)
    In active learning, the size and complexity of the training dataset changes over time. Simple models that are well specified by the amount of data available at the start of active learning might suffer from bias as more points are actively sampled. Flexible models that might be well suited to the full dataset can suffer from overfitting towards the start of active learning. We tackle this problem using Depth Uncertainty Networks (DUNs), a BNN variant in which the depth of the network, and thus its complexity, is inferred. We find that DUNs outperform other BNN variants on several active learning tasks. Importantly, we show that on the tasks in which DUNs perform best they present notably less overfitting than baselines.
    Understanding CNNs from excitations. (arXiv:2205.00932v2 [cs.CV] UPDATED)
    For instance-level explanation, in order to reveal the relations between high-level semantics and detailed spatial information, this paper proposes a novel cognitive approach to neural networks, which named PANE. Under the guidance of PANE, a novel saliency map representation method, named IOM, is proposed for CNN-like models. We make the comparison with eight state-of-the-art saliency map representation methods. The experimental results show that IOM far outperforms baselines. The work of this paper may bring a new perspective to understand deep neural networks.
    Diverse Image Captioning with Grounded Style. (arXiv:2205.01813v1 [cs.CV])
    Stylized image captioning as presented in prior work aims to generate captions that reflect characteristics beyond a factual description of the scene composition, such as sentiments. Such prior work relies on given sentiment identifiers, which are used to express a certain global style in the caption, e.g. positive or negative, however without taking into account the stylistic content of the visual scene. To address this shortcoming, we first analyze the limitations of current stylized captioning datasets and propose COCO attribute-based augmentations to obtain varied stylized captions from COCO annotations. Furthermore, we encode the stylized information in the latent space of a Variational Autoencoder; specifically, we leverage extracted image attributes to explicitly structure its sequential latent space according to different localized style characteristics. Our experiments on the Senticap and COCO datasets show the ability of our approach to generate accurate captions with diversity in styles that are grounded in the image.
    Regret-Optimal Filtering for Prediction and Estimation. (arXiv:2101.10357v3 [math.OC] UPDATED)
    The filtering problem of causally estimating a desired signal from a related observation signal is investigated through the lens of regret optimization. Classical filter designs, such as $\mathcal H_2$ (Kalman) and $\mathcal H_\infty$, minimize the average and worst-case estimation errors, respectively. As a result $\mathcal H_2$ filters are sensitive to inaccuracies in the underlying statistical model, and $\mathcal H_\infty$ filters are overly conservative since they safeguard against the worst-case scenario. We propose instead to minimize the \emph{regret} in order to design filters that perform well in different noise regimes by comparing their performance with that of a clairvoyant filter. More explicitly, we minimize the largest deviation of the squared estimation error of a causal filter from that of a non-causal filter that has access to future observations. In this sense, the regret-optimal filter will have the best competitive performance with respect to the non-causal benchmark filter no matter what the true signal and the observation process are. For the important case of signals that can be described with a time-invariant state-space, we provide an explicit construction for the regret optimal filter in the estimation (causal) and the prediction (strictly-causal) regimes. These solutions are obtained by reducing the regret filtering problem to a Nehari problem, i.e., approximating a non-causal operator by a causal one in spectral norm. The regret-optimal filters bear some resemblance to Kalman and $H_\infty$ filters: they are expressed as state-space models, inherit the finite dimension of the original state-space, and their solutions require solving algebraic Riccati equations. Numerical simulations demonstrate that regret minimization inherently interpolates between the performances of the $H_2$ and $H_\infty$ filters and is thus a viable approach for filter design.
    Making SGD Parameter-Free. (arXiv:2205.02160v1 [math.OC])
    We develop an algorithm for parameter-free stochastic convex optimization (SCO) whose rate of convergence is only a double-logarithmic factor larger than the optimal rate for the corresponding known-parameter setting. In contrast, the best previously known rates for parameter-free SCO are based on online parameter-free regret bounds, which contain unavoidable excess logarithmic terms compared to their known-parameter counterparts. Our algorithm is conceptually simple, has high-probability guarantees, and is also partially adaptive to unknown gradient norms, smoothness, and strong convexity. At the heart of our results is a novel parameter-free certificate for SGD step size choice, and a time-uniform concentration result that assumes no a-priori bounds on SGD iterates.
    Ensembling Off-the-shelf Models for GAN Training. (arXiv:2112.09130v3 [cs.CV] UPDATED)
    The advent of large-scale training has produced a cornucopia of powerful visual recognition models. However, generative models, such as GANs, have traditionally been trained from scratch in an unsupervised manner. Can the collective "knowledge" from a large bank of pretrained vision models be leveraged to improve GAN training? If so, with so many models to choose from, which one(s) should be selected, and in what manner are they most effective? We find that pretrained computer vision models can significantly improve performance when used in an ensemble of discriminators. Notably, the particular subset of selected models greatly affects performance. We propose an effective selection mechanism, by probing the linear separability between real and fake samples in pretrained model embeddings, choosing the most accurate model, and progressively adding it to the discriminator ensemble. Interestingly, our method can improve GAN training in both limited data and large-scale settings. Given only 10k training samples, our FID on LSUN Cat matches the StyleGAN2 trained on 1.6M images. On the full dataset, our method improves FID by 1.5x to 2x on cat, church, and horse categories of LSUN.
    AIFB-WebScience at SemEval-2022 Task 12: Relation Extraction First -- Using Relation Extraction to Identify Entities. (arXiv:2203.05325v2 [cs.CL] UPDATED)
    In this paper, we present an end-to-end joint entity and relation extraction approach based on transformer-based language models. We apply the model to the task of linking mathematical symbols to their descriptions in LaTeX documents. In contrast to existing approaches, which perform entity and relation extraction in sequence, our system incorporates information from relation extraction into entity extraction. This means that the system can be trained even on data sets where only a subset of all valid entity spans is annotated. We provide an extensive evaluation of the proposed system and its strengths and weaknesses. Our approach, which can be scaled dynamically in computational complexity at inference time, produces predictions with high precision and reaches 3rd place in the leaderboard of SemEval-2022 Task 12. For inputs in the domain of physics and math, it achieves high relation extraction macro F1 scores of 95.43% and 79.17%, respectively. The code used for training and evaluating our models is available at: https://github.com/nicpopovic/RE1st
    Inverting brain grey matter models with likelihood-free inference: a tool for trustable cytoarchitecture measurements. (arXiv:2111.08693v2 [q-bio.QM] UPDATED)
    Effective characterisation of the brain grey matter cytoarchitecture with quantitative sensitivity to soma density and volume remains an unsolved challenge in diffusion MRI (dMRI). Solving the problem of relating the dMRI signal with cytoarchitectural characteristics calls for the definition of a mathematical model that describes brain tissue via a handful of physiologically-relevant parameters and an algorithm for inverting the model. To address this issue, we propose a new forward model, specifically a new system of equations, requiring a few relatively sparse b-shells. We then apply modern tools from Bayesian analysis known as likelihood-free inference (LFI) to invert our proposed model. As opposed to other approaches from the literature, our algorithm yields not only an estimation of the parameter vector $\theta$ that best describes a given observed data point $x_0$, but also a full posterior distribution $p(\theta|x_0)$ over the parameter space. This enables a richer description of the model inversion, providing indicators such as credible intervals for the estimated parameters and a complete characterization of the parameter regions where the model may present indeterminacies. We approximate the posterior distribution using deep neural density estimators, known as normalizing flows, and fit them using a set of repeated simulations from the forward model. We validate our approach on simulations using dmipy and then apply the whole pipeline on two publicly available datasets.
    The leap to ordinal: detailed functional prognosis after traumatic brain injury with a flexible modelling approach. (arXiv:2202.04801v2 [cs.LG] UPDATED)
    When a patient is admitted to the intensive care unit (ICU) after a traumatic brain injury (TBI), an early prognosis is essential for baseline risk adjustment and shared decision making. TBI outcomes are commonly categorised by the Glasgow Outcome Scale-Extended (GOSE) into 8, ordered levels of functional recovery at 6 months after injury. Existing ICU prognostic models predict binary outcomes at a certain threshold of GOSE (e.g., prediction of survival [GOSE>1] or functional independence [GOSE>4]). We aimed to develop ordinal prediction models that concurrently predict probabilities of each GOSE score. From a prospective cohort (n=1,550, 65 centres) in the ICU stratum of the Collaborative European NeuroTrauma Effectiveness Research in TBI (CENTER-TBI) patient dataset, we extracted all clinical information within 24 hours of ICU admission (1,151 predictors) and 6-month GOSE scores. We analysed the effect of 2 design elements on ordinal model performance: (1) the baseline predictor set, ranging from a concise set of 10 validated predictors to a token-embedded representation of all possible predictors, and (2) the modelling strategy, from ordinal logistic regression to multinomial deep learning. With repeated k-fold cross-validation, we found that expanding the baseline predictor set significantly improved ordinal prediction performance while increasing analytical complexity did not. Half of these gains could be achieved with the addition of 8 high-impact predictors (2 demographic variables, 4 protein biomarkers, and 2 severity assessments) to the concise set. At best, ordinal models achieved 0.76 (95% CI: 0.74-0.77) ordinal discrimination ability (ordinal c-index) and 57% (95% CI: 54%-60%) explanation of ordinal variation in 6-month GOSE (Somers' D). Our results motivate the search for informative predictors for higher GOSE and the development of ordinal dynamic prediction models.
    On Deep Neural Network Calibration by Regularization and its Impact on Refinement. (arXiv:2106.09385v3 [cs.LG] UPDATED)
    Deep neural networks have been shown to be highly miscalibrated. often they tend to be overconfident in their predictions. It poses a significant challenge for safety-critical systems to utilise deep neural networks (DNNs), reliably. Many recently proposed approaches to mitigate this have demonstrated substantial progress in improving DNN calibration. However, they hardly touch upon refinement, which historically has been an essential aspect of calibration. Refinement indicates separability of a network's correct and incorrect predictions. This paper presents a theoretically and empirically supported exposition reviewing refinement of a calibrated model. Firstly, we show the breakdown of expected calibration error (ECE), into predicted confidence and refinement under the assumption of over-confident predictions. Secondly, linking with this result, we highlight that regularization based calibration only focuses on naively reducing a model's confidence. This logically has a severe downside to a model's refinement as correct and incorrect predictions become tightly coupled. Lastly, connecting refinement with ECE also provides support to existing refinement based approaches which improve calibration but do not explain the reasoning behind it. We support our claims through rigorous empirical evaluations of many state of the art calibration approaches on widely used datasets and neural networks. We find that many calibration approaches with the likes of label smoothing, mixup etc. lower the usefulness of a DNN by degrading its refinement. Even under natural data shift, this calibration-refinement trade-off holds for the majority of calibration methods.  ( 2 min )
    Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models. (arXiv:2110.04478v2 [cs.DC] UPDATED)
    Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multi-dimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of keeping all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of the single All-Reduce by 1.72X (2.70X max), and improve the end-to-end training iteration performance of real workloads such as ResNet-152, GNMT, DLRM, and Transformer-1T by 1.49X (2.25X max), 1.30X (1.78X max), 1.30X (1.77X max), and 1.25X (1.53X max), respectively.
    PocketNN: Integer-only Training and Inference of Neural Networks via Direct Feedback Alignment and Pocket Activations in Pure C++. (arXiv:2201.02863v6 [cs.LG] UPDATED)
    Standard deep learning algorithms are implemented using floating-point real numbers. This presents an obstacle for implementing them on low-end devices which may not have dedicated floating-point units (FPUs). As a result, researchers in tinyML have considered machine learning algorithms that can train and run a deep neural network (DNN) on a low-end device using integer operations only. In this paper we propose PocketNN, a light and self-contained proof-of-concept framework in pure C++ for the training and inference of DNNs using only integers. Unlike other approaches, PocketNN directly operates on integers without requiring any explicit quantization algorithms or customized fixed-point formats. This was made possible by pocket activations, which are a family of activation functions devised for integer-only DNNs, and an emerging DNN training algorithm called direct feedback alignment (DFA). Unlike the standard backpropagation (BP), DFA trains each layer independently, thus avoiding integer overflow which is a key problem when using BP with integer-only operations. We used PocketNN to train some DNNs on two well-known datasets, MNIST and Fashion-MNIST. Our experiments show that the DNNs trained with our PocketNN achieved 96.98% and 87.7% accuracies on MNIST and Fashion-MNIST datasets, respectively. The accuracies are very close to the equivalent DNNs trained using BP with floating-point real number operations, such that accuracy degradations were just 1.02%p and 2.09%p, respectively. Finally, our PocketNN has high compatibility and portability for low-end devices as it is open source and implemented in pure C++ without any dependencies.
    Negative Sampling in Variational Autoencoders. (arXiv:1910.02760v3 [cs.LG] UPDATED)
    Modern deep artificial neural networks have achieved great success in the domain of computer vision and beyond. However, their application to many real-world tasks is undermined by certain limitations, such as overconfident uncertainty estimates on out-of-distribution data or performance deterioration under data distribution shifts. Several types of deep learning models used for density estimation through probabilistic generative modeling have been shown to fail to detect out-of-distribution samples by assigning higher likelihoods to anomalous data. We investigate this failure mode in Variational Autoencoder models, which are also prone to this, and improve upon the out-of-distribution generalization performance of the model by employing an alternative training scheme utilizing negative samples. We present a fully unsupervised version: when the model is trained in an adversarial manner, the generator's own outputs can be used as negative samples. We demonstrate empirically the effectiveness of the approach in reducing the overconfident likelihood estimates of out-of-distribution inputs on image data.
    Evaluating Transferability for Covid 3D Localization Using CT SARS-CoV-2 segmentation models. (arXiv:2205.02152v1 [eess.IV])
    Recent studies indicate that detecting radiographic patterns on CT scans can yield high sensitivity and specificity for COVID-19 localization. In this paper, we investigate the appropriateness of deep learning models transferability, for semantic segmentation of pneumonia-infected areas in CT images. Transfer learning allows for the fast initialization/ reutilization of detection models, given that large volumes of training are not available. Our work explores the efficacy of using pre-trained U-Net architectures, on a specific CT data set, for identifying Covid-19 side-effects over images from different datasets. Experimental results indicate improvement in the segmentation accuracy of identifying COVID-19 infected regions.  ( 2 min )
    Multistage linguistic conditioning of convolutional layers for speech emotion recognition. (arXiv:2110.06650v2 [cs.LG] UPDATED)
    In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER). We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN), and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Experiments on the MSP-Podcast and IEMOCAP datasets demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline and their unimodal constituents, both in terms of quantitative performance and qualitative behaviour. Overall, our multistage fusion shows better quantitative performance, surpassing alternatives on most of our evaluations. This illustrates the potential of multistage fusion in better assimilating text and audio information.  ( 2 min )
    Informativeness and Invariance: Two Perspectives on Spurious Correlations in Natural Language. (arXiv:2204.04487v2 [cs.CL] UPDATED)
    Spurious correlations are a threat to the trustworthiness of natural language processing systems, motivating research into methods for identifying and eliminating them. However, addressing the problem of spurious correlations requires more clarity on what they are and how they arise in language data. Gardner et al (2021) argue that the compositional nature of language implies that \emph{all} correlations between labels and individual "input features" are spurious. This paper analyzes this proposal in the context of a toy example, demonstrating three distinct conditions that can give rise to feature-label correlations in a simple PCFG. Linking the toy example to a structured causal model shows that (1) feature-label correlations can arise even when the label is invariant to interventions on the feature, and (2) feature-label correlations may be absent even when the label is sensitive to interventions on the feature. Because input features will be individually correlated with labels in all but very rare circumstances, domain knowledge must be applied to identify spurious correlations that pose genuine robustness threats.
    Zero-Shot Text-Guided Object Generation with Dream Fields. (arXiv:2112.01455v2 [cs.CV] UPDATED)
    We combine neural rendering with multi-modal image and text representations to synthesize diverse 3D objects solely from natural language descriptions. Our method, Dream Fields, can generate the geometry and color of a wide range of objects without 3D supervision. Due to the scarcity of diverse, captioned 3D data, prior methods only generate objects from a handful of categories, such as ShapeNet. Instead, we guide generation with image-text models pre-trained on large datasets of captioned images from the web. Our method optimizes a Neural Radiance Field from many camera views so that rendered images score highly with a target caption according to a pre-trained CLIP model. To improve fidelity and visual quality, we introduce simple geometric priors, including sparsity-inducing transmittance regularization, scene bounds, and new MLP architectures. In experiments, Dream Fields produce realistic, multi-view consistent object geometry and color from a variety of natural language captions.  ( 2 min )
    An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks. (arXiv:2203.16773v2 [eess.AS] UPDATED)
    Speech representations learned from Self-supervised learning (SSL) models can benefit various speech processing tasks. However, utilizing SSL representations usually requires fine-tuning the pre-trained models or designing task-specific downstream models and loss functions, causing much memory usage and human labor. Recently, prompting in Natural Language Processing (NLP) has been found to be an efficient technique to leverage pre-trained language models (LMs). Specifically, prompt tuning optimizes a limited number of task-specific parameters with a fixed pre-trained model; as a result, only a small set of parameters is needed to be stored for each task. Prompt tuning improves computation and memory efficiency by leveraging the pre-trained LM's prediction ability. Nevertheless, such a paradigm is little studied in the speech community. We report in this paper the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM). Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models. We further study the technique in challenging sequence generation tasks. Prompt tuning also demonstrates its potential, while the limitation and possible research directions are discussed in this paper. The source code is available on https://github.com/ga642381/SpeechPrompt.
    Local versions of sum-of-norms clustering. (arXiv:2109.09589v2 [cs.LG] UPDATED)
    Sum-of-norms clustering is a convex optimization problem whose solution can be used for the clustering of multivariate data. We propose and study a localized version of this method, and show in particular that it can separate arbitrarily close balls in the stochastic ball model. More precisely, we prove a quantitative bound on the error incurred in the clustering of disjoint connected sets. Our bound is expressed in terms of the number of datapoints and the localization length of the functional.
    FederatedScope: A Flexible Federated Learning Platform for Heterogeneity. (arXiv:2204.05011v3 [cs.LG] UPDATED)
    Although remarkable progress has been made by the existing federated learning (FL) platforms to provide fundamental functionalities for development, these platforms cannot well tackle the challenges brought by the heterogeneity of FL scenarios from both academia and industry. To fill this gap, in this paper, we propose a flexible federated learning platform, named FederatedScope, for handling various types of heterogeneity in FL. Considering both flexibility and extensibility, FederatedScope adopts an event-driven architecture to frame an FL course into event-handler pairs: the behaviors of participants are described in handlers, and triggered by events of message passing or meeting certain conditions in training. For a new FL application, developers only need to specify the adopted FL algorithm by defining new types of events and the corresponding handling functions based on participants' behaviors, which would be automatically executed in an asynchronous way for balancing effectiveness and efficiency in FederatedScope. Meanwhile, towards an easy-to-use platform, FederatedScope provides rich built-in algorithms, including personalization, federated aggregation, privacy protection, and privacy attack, for users to conveniently customize participant-specific training, fusing, aggregating, and protecting. Besides, a federated hyperparameter optimization module is integrated into FederatedScope for users to automatically tune their FL systems for resolving the unstable issues brought by heterogeneity. We conduct a series of experiments on the provided easy-to-use and comprehensive FL benchmarks to validate the correctness and efficiency of FederatedScope. We have released FederatedScope for users on https://github.com/alibaba/FederatedScope to promote research and industrial deployment of federated learning in a variety of real-world applications.
    Efficient Accelerator for Dilated and Transposed Convolution with Decomposition. (arXiv:2205.02103v1 [cs.AR])
    Hardware acceleration for dilated and transposed convolution enables real time execution of related tasks like segmentation, but current designs are specific for these convolutional types or suffer from complex control for reconfigurable designs. This paper presents a design that decomposes input or weight for dilated and transposed convolutions respectively to skip redundant computations and thus executes efficiently on existing dense CNN hardware as well. The proposed architecture can cut down 87.8\% of the cycle counts to achieve 8.2X speedup over a naive execution for the ENet case.
    Music-to-Dance Generation with Optimal Transport. (arXiv:2112.01806v2 [cs.SD] UPDATED)
    Dance choreography for a piece of music is a challenging task, having to be creative in presenting distinctive stylistic dance elements while taking into account the musical theme and rhythm. It has been tackled by different approaches such as similarity retrieval, sequence-to-sequence modeling and generative adversarial networks, but their generated dance sequences are often short of motion realism, diversity and music consistency. In this paper, we propose a Music-to-Dance with Optimal Transport Network (MDOT-Net) for learning to generate 3D dance choreographies from music. We introduce an optimal transport distance for evaluating the authenticity of the generated dance distribution and a Gromov-Wasserstein distance to measure the correspondence between the dance distribution and the input music. This gives a well defined and non-divergent training objective that mitigates the limitation of standard GAN training which is frequently plagued with instability and divergent generator loss issues. Extensive experiments demonstrate that our MDOT-Net can synthesize realistic and diverse dances which achieve an organic unity with the input music, reflecting the shared intentionality and matching the rhythmic articulation. Sample results are found at https://www.youtube.com/watch?v=dErfBkrlUO8.
    UserIdentifier: Implicit User Representations for Simple and Effective Personalized Sentiment Analysis. (arXiv:2110.00135v2 [cs.LG] UPDATED)
    Global models are trained to be as generalizable as possible, with user invariance considered desirable since the models are shared across multitudes of users. As such, these models are often unable to produce personalized responses for individual users, based on their data. Contrary to widely-used personalization techniques based on few-shot learning, we propose UserIdentifier, a novel scheme for training a single shared model for all users. Our approach produces personalized responses by adding fixed, non-trainable user identifiers to the input data. We empirically demonstrate that this proposed method outperforms the prefix-tuning based state-of-the-art approach by up to 13%, on a suite of sentiment analysis datasets. We also show that, unlike prior work, this method needs neither any additional model parameters nor any extra rounds of few-shot fine-tuning.
    Differentiable Time-Frequency Scattering in Kymatio. (arXiv:2204.08269v3 [cs.SD] UPDATED)
    Joint time-frequency scattering (JTFS) is a convolutional operator in the time-frequency domain which extracts spectrotemporal modulations at various rates and scales. It offers an idealized model of spectrotemporal receptive fields (STRF) in the primary auditory cortex, and thus may serve as a biological plausible surrogate for human perceptual judgments at the scale of isolated audio events. Yet, prior implementations of JTFS and STRF have remained outside of the standard toolkit of perceptual similarity measures and evaluation methods for audio generation. We trace this issue down to three limitations: differentiability, speed, and flexibility. In this paper, we present an implementation of time-frequency scattering in Kymatio, an open-source Python package for scattering transforms. Unlike prior implementations, Kymatio accommodates NumPy and PyTorch as backends and is thus portable on both CPU and GPU. We demonstrate the usefulness of JTFS in Kymatio via three applications: unsupervised manifold learning of spectrotemporal modulations, supervised classification of musical instruments, and texture resynthesis of bioacoustic sounds.  ( 2 min )
    CausalNLP: A Practical Toolkit for Causal Inference with Text. (arXiv:2106.08043v4 [cs.CL] UPDATED)
    Causal inference is the process of estimating the effect or impact of a treatment on an outcome with other covariates as potential confounders (and mediators) that may need to be controlled. The vast majority of existing methods and systems for causal inference assume that all variables under consideration are categorical or numerical (e.g., gender, price, enrollment). In this paper, we present CausalNLP, a toolkit for inferring causality with observational data that includes text in addition to traditional numerical and categorical variables. CausalNLP employs the use of meta learners for treatment effect estimation and supports using raw text and its linguistic properties as a treatment, an outcome, or a "controlled-for" variable (e.g., confounder). The library is open source and available at: https://github.com/amaiya/causalnlp.  ( 2 min )
    Visual Similarity Attention. (arXiv:1911.07381v2 [cs.CV] UPDATED)
    While there has been substantial progress in learning suitable distance metrics, these techniques in general lack transparency and decision reasoning, i.e., explaining why the input set of images is similar or dissimilar. In this work, we solve this key problem by proposing the first method to generate generic visual similarity explanations with gradient-based attention. We demonstrate that our technique is agnostic to the specific similarity model type, e.g., we show applicability to Siamese, triplet, and quadruplet models. Furthermore, we make our proposed similarity attention a principled part of the learning process, resulting in a new paradigm for learning similarity functions. We demonstrate that our learning mechanism results in more generalizable, as well as explainable, similarity models. Finally, we demonstrate the generality of our framework by means of experiments on a variety of tasks, including image retrieval, person re-identification, and low-shot semantic segmentation.  ( 2 min )
    DiCOVA-Net: Diagnosing COVID-19 using Acoustics based on Deep Residual Network for the DiCOVA Challenge 2021. (arXiv:2107.06126v2 [cs.SD] UPDATED)
    In this paper, we propose a deep residual network-based method, namely the DiCOVA-Net, to identify COVID-19 infected patients based on the acoustic recording of their coughs. Since there are far more healthy people than infected patients, this classification problem faces the challenge of imbalanced data. To improve the model's ability to recognize minority class (the infected patients), we introduce data augmentation and cost-sensitive methods into our model. Besides, considering the particularity of this task, we deploy some fine-tuning techniques to adjust the pre-training ResNet50. Furthermore, to improve the model's generalizability, we use ensemble learning to integrate prediction results from multiple base classifiers generated using different random seeds. To evaluate the proposed DiCOVA-Net's performance, we conducted experiments with the DiCOVA challenge dataset. The results show that our method has achieved 85.43\% in AUC, among the top of all competing teams.  ( 2 min )
    When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer. (arXiv:2110.14782v3 [cs.CL] UPDATED)
    While recent work on multilingual language models has demonstrated their capacity for cross-lingual zero-shot transfer on downstream tasks, there is a lack of consensus in the community as to what shared properties between languages enable such transfer. Analyses involving pairs of natural languages are often inconclusive and contradictory since languages simultaneously differ in many linguistic aspects. In this paper, we perform a large-scale empirical study to isolate the effects of various linguistic properties by measuring zero-shot transfer between four diverse natural languages and their counterparts constructed by modifying aspects such as the script, word order, and syntax. Among other things, our experiments show that the absence of sub-word overlap significantly affects zero-shot transfer when languages differ in their word order, and there is a strong correlation between transfer performance and word embedding alignment between languages (e.g., R=0.94 on the task of NLI). Our results call for focus in multilingual models on explicitly improving word embedding alignment between languages rather than relying on its implicit emergence.
    Towards All-around Knowledge Transferring: Learning From Task-irrelevant Labels. (arXiv:2011.08470v2 [cs.LG] UPDATED)
    Deep neural models have hitherto achieved significant performances on numerous classification tasks, but meanwhile require sufficient manually annotated data. Since it is extremely time-consuming and expensive to annotate adequate data for each classification task, learning an empirically effective model with generalization on small dataset has received increased attention. Existing efforts mainly focus on transferring task-relevant knowledge from other similar data to tackle the issue. These approaches have yielded remarkable improvements, yet neglecting the fact that the task-irrelevant features could bring out massive negative transfer effects. To date, no large-scale studies have been performed to investigate the impact of task-irrelevant features, let alone the utilization of this kind of features. In this paper, we firstly propose Task-Irrelevant Transfer Learning (TIRTL) to exploit task-irrelevant features, which mainly are extracted from task-irrelevant labels. Particularly, we suppress the expression of task-irrelevant information and facilitate the learning process of classification. We also provide a theoretical explanation of our method. In addition, TIRTL does not conflict with those that have previously exploited task-relevant knowledge and can be well combined to enable the simultaneous utilization of task-relevant and task-irrelevant features for the first time. In order to verify the effectiveness of our theory and method, we conduct extensive experiments on facial expression recognition and digit recognition tasks. Our source code will be also available in the future for reproducibility.
    Optimizing One-pixel Black-box Adversarial Attacks. (arXiv:2205.02116v1 [cs.CR])
    The output of Deep Neural Networks (DNN) can be altered by a small perturbation of the input in a black box setting by making multiple calls to the DNN. However, the high computation and time required makes the existing approaches unusable. This work seeks to improve the One-pixel (few-pixel) black-box adversarial attacks to reduce the number of calls to the network under attack. The One-pixel attack uses a non-gradient optimization algorithm to find pixel-level perturbations under the constraint of a fixed number of pixels, which causes the network to predict the wrong label for a given image. We show through experimental results how the choice of the optimization algorithm and initial positions to search can reduce function calls and increase attack success significantly, making the attack more practical in real-world settings.
    Wild Patterns Reloaded: A Survey of Machine Learning Security against Training Data Poisoning. (arXiv:2205.01992v1 [cs.LG])
    The success of machine learning is fueled by the increasing availability of computing power and large training datasets. The training data is used to learn new models or update existing ones, assuming that it is sufficiently representative of the data that will be encountered at test time. This assumption is challenged by the threat of poisoning, an attack that manipulates the training data to compromise the model's performance at test time. Although poisoning has been acknowledged as a relevant threat in industry applications, and a variety of different attacks and defenses have been proposed so far, a complete systematization and critical review of the field is still missing. In this survey, we provide a comprehensive systematization of poisoning attacks and defenses in machine learning, reviewing more than 200 papers published in the field in the last 15 years. We start by categorizing the current threat models and attacks, and then organize existing defenses accordingly. While we focus mostly on computer-vision applications, we argue that our systematization also encompasses state-of-the-art attacks and defenses for other data modalities. Finally, we discuss existing resources for research in poisoning, and shed light on the current limitations and open research questions in this research field.
    Saving Stochastic Bandits from Poisoning Attacks via Limited Data Verification. (arXiv:2102.07711v2 [cs.LG] UPDATED)
    We study bandit algorithms under data poisoning attacks in a bounded reward setting. We consider a strong attacker model in which the attacker can observe both the selected actions and their corresponding rewards and can contaminate the rewards with additive noise. We show that any bandit algorithm with regret $O(\log T)$ can be forced to suffer a regret $\Omega(T)$ with an expected amount of contamination $O(\log T)$. This amount of contamination is also necessary, as we prove that there exists an $O(\log T)$ regret bandit algorithm, specifically the classical UCB, that requires $\Omega(\log T)$ amount of contamination to suffer regret $\Omega(T)$. To combat such attacks, our second main contribution is to propose verification based mechanisms, which use limited verification to access a limited number of uncontaminated rewards. In particular, for the case of unlimited verifications, we show that with $O(\log T)$ expected number of verifications, a simple modified version of the ETC type bandit algorithm can restore the order optimal $O(\log T)$ regret irrespective of the amount of contamination used by the attacker. We also provide a UCB-like verification scheme, called Secure-UCB, that also enjoys full recovery from any attacks, also with $O(\log T)$ expected number of verifications. To derive a matching lower bound on the number of verifications, we prove that for any order-optimal bandit algorithm, this number of verifications $\Omega(\log T)$ is necessary to recover the order-optimal regret. On the other hand, when the number of verifications is bounded above by a budget $B$, we propose a novel algorithm, Secure-BARBAR, which provably achieves $O(\min\{C,T/\sqrt{B} \})$ regret with high probability against weak attackers where $C$ is the total amount of contamination by the attacker, which breaks the known $\Omega(C)$ lower bound of the non-verified setting if $C$ is large.  ( 3 min )
    Virtual Analog Modeling of Distortion Circuits Using Neural Ordinary Differential Equations. (arXiv:2205.01897v1 [eess.AS])
    Recent research in deep learning has shown that neural networks can learn differential equations governing dynamical systems. In this paper, we adapt this concept to Virtual Analog (VA) modeling to learn the ordinary differential equations (ODEs) governing the first-order and the second-order diode clipper. The proposed models achieve performance comparable to state-of-the-art recurrent neural networks (RNNs) albeit using fewer parameters. We show that this approach does not require oversampling and allows to increase the sampling rate after the training has completed, which results in increased accuracy. Using a sophisticated numerical solver allows to increase the accuracy at the cost of slower processing. ODEs learned this way do not require closed forms but are still physically interpretable.  ( 2 min )
    DeepFD: Automated Fault Diagnosis and Localization for Deep Learning Programs. (arXiv:2205.01938v1 [cs.SE])
    As Deep Learning (DL) systems are widely deployed for mission-critical applications, debugging such systems becomes essential. Most existing works identify and repair suspicious neurons on the trained Deep Neural Network (DNN), which, unfortunately, might be a detour. Specifically, several existing studies have reported that many unsatisfactory behaviors are actually originated from the faults residing in DL programs. Besides, locating faulty neurons is not actionable for developers, while locating the faulty statements in DL programs can provide developers with more useful information for debugging. Though a few recent studies were proposed to pinpoint the faulty statements in DL programs or the training settings (e.g. too large learning rate), they were mainly designed based on predefined rules, leading to many false alarms or false negatives, especially when the faults are beyond their capabilities. In view of these limitations, in this paper, we proposed DeepFD, a learning-based fault diagnosis and localization framework which maps the fault localization task to a learning problem. In particular, it infers the suspicious fault types via monitoring the runtime features extracted during DNN model training and then locates the diagnosed faults in DL programs. It overcomes the limitations by identifying the root causes of faults in DL programs instead of neurons and diagnosing the faults by a learning approach instead of a set of hard-coded rules. The evaluation exhibits the potential of DeepFD. It correctly diagnoses 52% faulty DL programs, compared with around half (27%) achieved by the best state-of-the-art works. Besides, for fault localization, DeepFD also outperforms the existing works, correctly locating 42% faulty programs, which almost doubles the best result (23%) achieved by the existing works.  ( 2 min )
    Dynamic Sparse R-CNN. (arXiv:2205.02101v1 [cs.CV])
    Sparse R-CNN is a recent strong object detection baseline by set prediction on sparse, learnable proposal boxes and proposal features. In this work, we propose to improve Sparse R-CNN with two dynamic designs. First, Sparse R-CNN adopts a one-to-one label assignment scheme, where the Hungarian algorithm is applied to match only one positive sample for each ground truth. Such one-to-one assignment may not be optimal for the matching between the learned proposal boxes and ground truths. To address this problem, we propose dynamic label assignment (DLA) based on the optimal transport algorithm to assign increasing positive samples in the iterative training stages of Sparse R-CNN. We constrain the matching to be gradually looser in the sequential stages as the later stage produces the refined proposals with improved precision. Second, the learned proposal boxes and features remain fixed for different images in the inference process of Sparse R-CNN. Motivated by dynamic convolution, we propose dynamic proposal generation (DPG) to assemble multiple proposal experts dynamically for providing better initial proposal boxes and features for the consecutive training stages. DPG thereby can derive sample-dependent proposal boxes and features for inference. Experiments demonstrate that our method, named Dynamic Sparse R-CNN, can boost the strong Sparse R-CNN baseline with different backbones for object detection. Particularly, Dynamic Sparse R-CNN reaches the state-of-the-art 47.2% AP on the COCO 2017 validation set, surpassing Sparse R-CNN by 2.2% AP with the same ResNet-50 backbone.
    Understanding and Preventing Capacity Loss in Reinforcement Learning. (arXiv:2204.09560v2 [cs.LG] UPDATED)
    The reinforcement learning (RL) problem is rife with sources of non-stationarity, making it a notoriously difficult problem domain for the application of neural networks. We identify a mechanism by which non-stationary prediction targets can prevent learning progress in deep RL agents: \textit{capacity loss}, whereby networks trained on a sequence of target values lose their ability to quickly update their predictions over time. We demonstrate that capacity loss occurs in a range of RL agents and environments, and is particularly damaging to performance in sparse-reward tasks. We then present a simple regularizer, Initial Feature Regularization (InFeR), that mitigates this phenomenon by regressing a subspace of features towards its value at initialization, leading to significant performance improvements in sparse-reward environments such as Montezuma's Revenge. We conclude that preventing capacity loss is crucial to enable agents to maximally benefit from the learning signals they obtain throughout the entire training trajectory.
    ASE: Large-Scale Reusable Adversarial Skill Embeddings for Physically Simulated Characters. (arXiv:2205.01906v1 [cs.GR])
    The incredible feats of athleticism demonstrated by humans are made possible in part by a vast repertoire of general-purpose motor skills, acquired through years of practice and experience. These skills not only enable humans to perform complex tasks, but also provide powerful priors for guiding their behaviors when learning new tasks. This is in stark contrast to what is common practice in physics-based character animation, where control policies are most typically trained from scratch for each task. In this work, we present a large-scale data-driven framework for learning versatile and reusable skill embeddings for physically simulated characters. Our approach combines techniques from adversarial imitation learning and unsupervised reinforcement learning to develop skill embeddings that produce life-like behaviors, while also providing an easy to control representation for use on new downstream tasks. Our models can be trained using large datasets of unstructured motion clips, without requiring any task-specific annotation or segmentation of the motion data. By leveraging a massively parallel GPU-based simulator, we are able to train skill embeddings using over a decade of simulated experiences, enabling our model to learn a rich and versatile repertoire of skills. We show that a single pre-trained model can be effectively applied to perform a diverse set of new tasks. Our system also allows users to specify tasks through simple reward functions, and the skill embedding then enables the character to automatically synthesize complex and naturalistic strategies in order to achieve the task objectives.  ( 2 min )
    Explain and Conquer: Personalised Text-based Reviews to Achieve Transparency. (arXiv:2205.01759v1 [cs.LG])
    There are many contexts where dyadic data is present. Social networking is a well-known example, where transparency has grown on importance. In these contexts, pairs of items are linked building a network where interactions play a crucial role. Explaining why these relationships are established is core to address transparency. These explanations are often presented using text, thanks to the spread of the natural language understanding tasks. We have focused on the TripAdvisor platform, considering the applicability to other dyadic data contexts. The items are a subset of users and restaurants and the interactions the reviews posted by these users. Our aim is to represent and explain pairs (user, restaurant) established by agents (e.g., a recommender system or a paid promotion mechanism), so that personalisation is taken into account. We propose the PTER (Personalised TExt-based Reviews) model. We predict, from the available reviews for a given restaurant, those that fit to the specific user interactions. PTER leverages the BERT (Bidirectional Encoders Representations from Transformers) language model. We customised a deep neural network following the feature-based approach. The performance metrics show the validity of our labelling proposal. We defined an evaluation framework based on a clustering process to assess our personalised representation. PTER clearly outperforms the proposed adversary in 5 of the 6 datasets, with a minimum ratio improvement of 4%.
    Graph Self-Supervised Learning: A Survey. (arXiv:2103.00111v5 [cs.LG] UPDATED)
    Deep learning on graphs has attracted significant interests recently. However, most of the works have focused on (semi-) supervised learning, resulting in shortcomings including heavy label reliance, poor generalization, and weak robustness. To address these issues, self-supervised learning (SSL), which extracts informative knowledge through well-designed pretext tasks without relying on manual labels, has become a promising and trending learning paradigm for graph data. Different from SSL on other domains like computer vision and natural language processing, SSL on graphs has an exclusive background, design ideas, and taxonomies. Under the umbrella of graph self-supervised learning, we present a timely and comprehensive review of the existing approaches which employ SSL techniques for graph data. We construct a unified framework that mathematically formalizes the paradigm of graph SSL. According to the objectives of pretext tasks, we divide these approaches into four categories: generation-based, auxiliary property-based, contrast-based, and hybrid approaches. We further describe the applications of graph SSL across various research fields and summarize the commonly used datasets, evaluation benchmark, performance comparison and open-source codes of graph SSL. Finally, we discuss the remaining challenges and potential future directions in this research field.
    Efficient Few-Shot Fine-Tuning for Opinion Summarization. (arXiv:2205.02170v1 [cs.CL])
    Abstractive summarization models are typically pre-trained on large amounts of generic texts, then fine-tuned on tens or hundreds of thousands of annotated samples. However, in opinion summarization, large annotated datasets of reviews paired with reference summaries are not available and would be expensive to create. This calls for fine-tuning methods robust to overfitting on small datasets. In addition, generically pre-trained models are often not accustomed to the specifics of customer reviews and, after fine-tuning, yield summaries with disfluencies and semantic mistakes. To address these problems, we utilize an efficient few-shot method based on adapters which, as we show, can easily store in-domain knowledge. Instead of fine-tuning the entire model, we add adapters and pre-train them in a task-specific way on a large corpus of unannotated customer reviews, using held-out reviews as pseudo summaries. Then, fine-tune the adapters on the small available human-annotated dataset. We show that this self-supervised adapter pre-training improves summary quality over standard fine-tuning by 2.0 and 1.3 ROUGE-L points on the Amazon and Yelp datasets, respectively. Finally, for summary personalization, we condition on aspect keyword queries, automatically created from generic datasets. In the same vein, we pre-train the adapters in a query-based manner on customer reviews and then fine-tune them on annotated datasets. This results in better-organized summary content reflected in improved coherence and fewer redundancies.  ( 2 min )
    Improved Orientation Estimation and Detection with Hybrid Object Detection Networks for Automotive Radar. (arXiv:2205.02111v1 [cs.CV])
    This paper presents novel hybrid architectures that combine grid- and point-based processing to improve the detection performance and orientation estimation of radar-based object detection networks. Purely grid-based detection models operate on a bird's-eye-view (BEV) projection of the input point cloud. These approaches suffer from a loss of detailed information through the discrete grid resolution. This applies in particular to radar object detection, where relatively coarse grid resolutions are commonly used to account for the sparsity of radar point clouds. In contrast, point-based models are not affected by this problem as they continuously process point clouds. However, they generally exhibit worse detection performances than grid-based methods. We show that a point-based model can extract neighborhood features, leveraging the exact relative positions of points, before grid rendering. This has significant benefits for a following convolutional detection backbone. In experiments on the public nuScenes dataset our hybrid architecture achieves improvements in terms of detection performance and orientation estimates over networks from previous literature.
    Leveraging Language to Learn Program Abstractions and Search Heuristics. (arXiv:2106.11053v3 [cs.LG] UPDATED)
    Inductive program synthesis, or inferring programs from examples of desired behavior, offers a general paradigm for building interpretable, robust, and generalizable machine learning systems. Effective program synthesis depends on two key ingredients: a strong library of functions from which to build programs, and an efficient search strategy for finding programs that solve a given task. We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis. When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization on three domains -- string editing, image composition, and abstract reasoning about scenes -- even when no natural language hints are available at test time.
    State Representation Learning for Goal-Conditioned Reinforcement Learning. (arXiv:2205.01965v1 [cs.LG])
    This paper presents a novel state representation for reward-free Markov decision processes. The idea is to learn, in a self-supervised manner, an embedding space where distances between pairs of embedded states correspond to the minimum number of actions needed to transition between them. Compared to previous methods, our approach does not require any domain knowledge, learning from offline and unlabeled data. We show how this representation can be leveraged to learn goal-conditioned policies, providing a notion of similarity between states and goals and a useful heuristic distance to guide planning and reinforcement learning algorithms. Finally, we empirically validate our method in classic control domains and multi-goal environments, demonstrating that our method can successfully learn representations in large and/or continuous domains.
    Few-Shot Backdoor Attacks on Visual Object Tracking. (arXiv:2201.13178v2 [cs.CV] UPDATED)
    Visual object tracking (VOT) has been widely adopted in mission-critical applications, such as autonomous driving and intelligent surveillance systems. In current practice, third-party resources such as datasets, backbone networks, and training platforms are frequently used to train high-performance VOT models. Whilst these resources bring certain convenience, they also introduce new security threats into VOT models. In this paper, we reveal such a threat where an adversary can easily implant hidden backdoors into VOT models by tempering with the training process. Specifically, we propose a simple yet effective few-shot backdoor attack (FSBA) that optimizes two losses alternately: 1) a \emph{feature loss} defined in the hidden feature space, and 2) the standard \emph{tracking loss}. We show that, once the backdoor is embedded into the target model by our FSBA, it can trick the model to lose track of specific objects even when the \emph{trigger} only appears in one or a few frames. We examine our attack in both digital and physical-world settings and show that it can significantly degrade the performance of state-of-the-art VOT trackers. We also show that our attack is resistant to potential defenses, highlighting the vulnerability of VOT models to potential backdoor attacks.
    Zero-shot Sonnet Generation with Discourse-level Planning and Aesthetics Features. (arXiv:2205.01821v1 [cs.CL])
    Poetry generation, and creative language generation in general, usually suffers from the lack of large training data. In this paper, we present a novel framework to generate sonnets that does not require training on poems. We design a hierarchical framework which plans the poem sketch before decoding. Specifically, a content planning module is trained on non-poetic texts to obtain discourse-level coherence; then a rhyme module generates rhyme words and a polishing module introduces imagery and similes for aesthetics purposes. Finally, we design a constrained decoding algorithm to impose the meter-and-rhyme constraint of the generated sonnets. Automatic and human evaluation show that our multi-stage approach without training on poem corpora generates more coherent, poetic, and creative sonnets than several strong baselines.
    Learning Purified Feature Representations from Task-irrelevant Labels. (arXiv:2102.10955v2 [cs.LG] UPDATED)
    Learning an empirically effective model with generalization using limited data is a challenging task for deep neural networks. In this paper, we propose a novel learning framework called PurifiedLearning to exploit task-irrelevant features extracted from task-irrelevant labels when training models on small-scale datasets. Particularly, we purify feature representations by using the expression of task-irrelevant information, thus facilitating the learning process of classification. Our work is built on solid theoretical analysis and extensive experiments, which demonstrate the effectiveness of PurifiedLearning. According to the theory we proved, PurifiedLearning is model-agnostic and doesn't have any restrictions on the model needed, so it can be combined with any existing deep neural networks with ease to achieve better performance. The source code of this paper will be available in the future for reproducibility.
    Data Cleansing for Indoor Positioning Wi-Fi Fingerprinting Datasets. (arXiv:2205.02096v1 [eess.SP])
    Wearable and IoT devices requiring positioning and localisation services grow in number exponentially every year. This rapid growth also produces millions of data entries that need to be pre-processed prior to being used in any indoor positioning system to ensure the data quality and provide a high Quality of Service (QoS) to the end-user. In this paper, we offer a novel and straightforward data cleansing algorithm for WLAN fingerprinting radio maps. This algorithm is based on the correlation among fingerprints using the Received Signal Strength (RSS) values and the Access Points (APs)'s identifier. We use those to compute the correlation among all samples in the dataset and remove fingerprints with low level of correlation from the dataset. We evaluated the proposed method on 14 independent publicly-available datasets. As a result, an average of 14% of fingerprints were removed from the datasets. The 2D positioning error was reduced by 2.7% and 3D positioning error by 5.3% with a slight increase in the floor hit rate by 1.2% on average. Consequently, the average speed of position prediction was also increased by 14%.
    Hypercomplex Image-to-Image Translation. (arXiv:2205.02087v1 [cs.CV])
    Image-to-image translation (I2I) aims at transferring the content representation from an input domain to an output one, bouncing along different target domains. Recent I2I generative models, which gain outstanding results in this task, comprise a set of diverse deep networks each with tens of million parameters. Moreover, images are usually three-dimensional being composed of RGB channels and common neural models do not take dimensions correlation into account, losing beneficial information. In this paper, we propose to leverage hypercomplex algebra properties to define lightweight I2I generative models capable of preserving pre-existing relations among image dimensions, thus exploiting additional input information. On manifold I2I benchmarks, we show how the proposed Quaternion StarGANv2 and parameterized hypercomplex StarGANv2 (PHStarGANv2) reduce parameters and storage memory amount while ensuring high domain translation performance and good image quality as measured by FID and LPIPS scores. Full code is available at: https://github.com/ispamm/HI2I.
    BERTMap: A BERT-based Ontology Alignment System. (arXiv:2112.02682v4 [cs.AI] UPDATED)
    Ontology alignment (a.k.a ontology matching (OM)) plays a critical role in knowledge integration. Owing to the success of machine learning in many domains, it has been applied in OM. However, the existing methods, which often adopt ad-hoc feature engineering or non-contextual word embeddings, have not yet outperformed rule-based systems especially in an unsupervised setting. In this paper, we propose a novel OM system named BERTMap which can support both unsupervised and semi-supervised settings. It first predicts mappings using a classifier based on fine-tuning the contextual embedding model BERT on text semantics corpora extracted from ontologies, and then refines the mappings through extension and repair by utilizing the ontology structure and logic. Our evaluation with three alignment tasks on biomedical ontologies demonstrates that BERTMap can often perform better than the leading OM systems LogMap and AML.
    Few-Shot Document-Level Relation Extraction. (arXiv:2205.02048v1 [cs.CL])
    We present FREDo, a few-shot document-level relation extraction (FSDLRE) benchmark. As opposed to existing benchmarks which are built on sentence-level relation extraction corpora, we argue that document-level corpora provide more realism, particularly regarding none-of-the-above (NOTA) distributions. Therefore, we propose a set of FSDLRE tasks and construct a benchmark based on two existing supervised learning data sets, DocRED and sciERC. We adapt the state-of-the-art sentence-level method MNAV to the document-level and develop it further for improved domain adaptation. We find FSDLRE to be a challenging setting with interesting new characteristics such as the ability to sample NOTA instances from the support set. The data, code, and trained models are available online (https://github.com/nicpopovic/FREDo).
    Using Deep Reinforcement Learning to solve Optimal Power Flow problem with generator failures. (arXiv:2205.02108v1 [cs.LG])
    Deep Reinforcement Learning (DRL) is being used in many domains. One of the biggest advantages of DRL is that it enables the continuous improvement of a learning agent. Secondly, the DRL framework is robust and flexible enough to be applicable to problems of varying nature and domain. Presented work is evidence of using the DRL technique to solve an Optimal Power Flow (OPF) problem. Two classical algorithms have been presented to solve the OPF problem. The drawbacks of the vanilla DRL application are discussed, and an algorithm is suggested to improve the performance. Secondly, a reward function for the OPF problem is presented that enables the solution of inherent issues in DRL. Reasons for divergence and degeneration in DRL are discussed, and the correct strategy to deal with them with respect to OPF is presented.
    The Limits of Word Level Differential Privacy. (arXiv:2205.02130v1 [cs.CR])
    As the issues of privacy and trust are receiving increasing attention within the research community, various attempts have been made to anonymize textual data. A significant subset of these approaches incorporate differentially private mechanisms to perturb word embeddings, thus replacing individual words in a sentence. While these methods represent very important contributions, have various advantages over other techniques and do show anonymization capabilities, they have several shortcomings. In this paper, we investigate these weaknesses and demonstrate significant mathematical constraints diminishing the theoretical privacy guarantee as well as major practical shortcomings with regard to the protection against deanonymization attacks, the preservation of content of the original sentences as well as the quality of the language output. Finally, we propose a new method for text anonymization based on transformer based language models fine-tuned for paraphrasing that circumvents most of the identified weaknesses and also offers a formal privacy guarantee. We evaluate the performance of our method via thorough experimentation and demonstrate superior performance over the discussed mechanisms.
    The Isabelle ENIGMA. (arXiv:2205.01981v1 [cs.AI])
    We significantly improve the performance of the E automated theorem prover on the Isabelle Sledgehammer problems by combining learning and theorem proving in several ways. In particular, we develop targeted versions of the ENIGMA guidance for the Isabelle problems, targeted versions of neural premise selection, and targeted strategies for E. The methods are trained in several iterations over hundreds of thousands untyped and typed first-order problems extracted from Isabelle. Our final best single-strategy ENIGMA and premise selection system improves the best previous version of E by 25.3% in 15 seconds, outperforming also all other previous ATP and SMT systems.
    On Continual Model Refinement in Out-of-Distribution Data Streams. (arXiv:2205.02014v1 [cs.CL])
    Real-world natural language processing (NLP) models need to be continually updated to fix the prediction errors in out-of-distribution (OOD) data streams while overcoming catastrophic forgetting. However, existing continual learning (CL) problem setups cannot cover such a realistic and complex scenario. In response to this, we propose a new CL problem formulation dubbed continual model refinement (CMR). Compared to prior CL settings, CMR is more practical and introduces unique challenges (boundary-agnostic and non-stationary distribution shift, diverse mixtures of multiple OOD data clusters, error-centric streams, etc.). We extend several existing CL approaches to the CMR setting and evaluate them extensively. For benchmarking and analysis, we propose a general sampling algorithm to obtain dynamic OOD data streams with controllable non-stationarity, as well as a suite of metrics measuring various aspects of online performance. Our experiments and detailed analysis reveal the promise and challenges of the CMR problem, supporting that studying CMR in dynamic OOD streams can benefit the longevity of deployed NLP models in production.
    Nonstationary Bandit Learning via Predictive Sampling. (arXiv:2205.01970v1 [cs.LG])
    We propose predictive sampling as an approach to selecting actions that balance between exploration and exploitation in nonstationary bandit environments. When specialized to stationary environments, predictive sampling is equivalent to Thompson sampling. However, predictive sampling is effective across a range of nonstationary environments in which Thompson sampling suffers. We establish a general information-theoretic bound on the Bayesian regret of predictive sampling. We then specialize this bound to study a modulated Bernoulli bandit environment. Our analysis highlights a key advantage of predictive sampling over Thompson sampling: predictive sampling deprioritizes investments in exploration where acquired information will quickly become less relevant.
    Concept Activation Vectors for Generating User-Defined 3D Shapes. (arXiv:2205.02102v1 [cs.CV])
    We explore the interpretability of 3D geometric deep learning models in the context of Computer-Aided Design (CAD). The field of parametric CAD can be limited by the difficulty of expressing high-level design concepts in terms of a few numeric parameters. In this paper, we use a deep learning architectures to encode high dimensional 3D shapes into a vectorized latent representation that can be used to describe arbitrary concepts. Specifically, we train a simple auto-encoder to parameterize a dataset of complex shapes. To understand the latent encoded space, we use the idea of Concept Activation Vectors (CAV) to reinterpret the latent space in terms of user-defined concepts. This allows modification of a reference design to exhibit more or fewer characteristics of a chosen concept or group of concepts. We also test the statistical significance of the identified concepts and determine the sensitivity of a physical quantity of interest across the dataset.
    Pre-RTL DNN Hardware Evaluator With Fused Layer Support. (arXiv:2205.01729v1 [cs.AR])
    With the popularity of the deep neural network (DNN), hardware accelerators are demanded for real time execution. However, lengthy design process and fast evolving DNN models make hardware evaluation hard to meet the time to market need. This paper proposes a pre-RTL DNN hardware evaluator that supports conventional layer-by-layer processing as well as the fused layer processing for low external bandwidth requirement. The evaluator supports two state-of-the-art accelerator architectures and finds the best hardware and layer fusion group The experimental results show the layer fusion scheme can achieve 55.6% memory bandwidth reduction, 36.7% latency improvement and 49.2% energy reduction compared with layer-by-layer operation.
    Recurrent Flow Networks: A Recurrent Latent Variable Model for Density Modelling of Urban Mobility. (arXiv:2006.05256v2 [stat.ML] UPDATED)
    Mobility-on-demand (MoD) systems represent a rapidly developing mode of transportation wherein travel requests are dynamically handled by a coordinated fleet of vehicles. Crucially, the efficiency of an MoD system highly depends on how well supply and demand distributions are aligned in spatio-temporal space (i.e., to satisfy user demand, cars have to be available in the correct place and at the desired time). To do so, we argue that predictive models should aim to explicitly disentangle between temporal} and spatial variability in the evolution of urban mobility demand. However, current approaches typically ignore this distinction by either treating both sources of variability jointly, or completely ignoring their presence in the first place. In this paper, we propose recurrent flow networks (RFN), where we explore the inclusion of (i) latent random variables in the hidden state of recurrent neural networks to model temporal variability, and (ii) normalizing flows to model the spatial distribution of mobility demand. We demonstrate how predictive models explicitly disentangling between spatial and temporal variability exhibit several desirable properties, and empirically show how this enables the generation of distributions matching potentially complex urban topologies.
    Palette: Image-to-Image Diffusion Models. (arXiv:2111.05826v2 [cs.CV] UPDATED)
    This paper develops a unified framework for image-to-image translation based on conditional diffusion models and evaluates this framework on four challenging image-to-image translation tasks, namely colorization, inpainting, uncropping, and JPEG restoration. Our simple implementation of image-to-image diffusion models outperforms strong GAN and regression baselines on all tasks, without task-specific hyper-parameter tuning, architecture customization, or any auxiliary loss or sophisticated new techniques needed. We uncover the impact of an L2 vs. L1 loss in the denoising diffusion objective on sample diversity, and demonstrate the importance of self-attention in the neural architecture through empirical studies. Importantly, we advocate a unified evaluation protocol based on ImageNet, with human evaluation and sample quality scores (FID, Inception Score, Classification Accuracy of a pre-trained ResNet-50, and Perceptual Distance against original images). We expect this standardized evaluation protocol to play a role in advancing image-to-image translation research. Finally, we show that a generalist, multi-task diffusion model performs as well or better than task-specific specialist counterparts. Check out https://diffusion-palette.github.io for an overview of the results.
    Machine Learning based Framework for Robust Price-Sensitivity Estimation with Application to Airline Pricing. (arXiv:2205.01875v1 [stat.ML])
    We consider the problem of dynamic pricing of a product in the presence of feature-dependent price sensitivity. Based on the Poisson semi-parametric approach, we construct a flexible yet interpretable demand model where the price related part is parametric while the remaining (nuisance) part of the model is non-parametric and can be modeled via sophisticated ML techniques. The estimation of price-sensitivity parameters of this model via direct one-stage regression techniques may lead to biased estimates. We propose a two-stage estimation methodology which makes the estimation of the price-sensitivity parameters robust to biases in the nuisance parameters of the model. In the first-stage we construct the estimators of observed purchases and price given the feature vector using sophisticated ML estimators like deep neural networks. Utilizing the estimators from the first-stage, in the second-stage we leverage a Bayesian dynamic generalized linear model to estimate the price-sensitivity parameters. We test the performance of the proposed estimation schemes on simulated and real sales transaction data from Airline industry. Our numerical studies demonstrate that the two-stage approach provides more accurate estimates of price-sensitivity parameters as compared to direct one-stage approach.
    Self-supervised learning unveils morphological clusters behind lung cancer types and prognosis. (arXiv:2205.01931v1 [cs.CV])
    Histopathological images of tumors contain abundant information about how tumors grow and how they interact with their micro-environment. Characterizing and improving our understanding of phenotypes could reveal factors related to tumor progression and their underpinning biological processes, ultimately improving diagnosis and treatment. In recent years, the field of histological deep learning applications has seen great progress, yet most of these applications focus on a supervised approach, relating tissue and associated sample annotations. Supervised approaches have their impact limited by two factors. Firstly, high-quality labels are expensive in time and effort, which makes them not easily scalable. Secondly, these methods focus on predicting annotations from histological images, fundamentally restricting the discovery of new tissue phenotypes. These limitations emphasize the importance of using new methods that can characterize tissue by the features enclosed in the image, without pre-defined annotation or supervision. We present Phenotype Representation Learning (PRL), a methodology to extract histomorphological phenotypes through self-supervised learning and community detection. PRL creates phenotype clusters by identifying tissue patterns that share common morphological and cellular features, allowing to describe whole slide images through compositional representations of cluster contributions. We used this framework to analyze histopathology slides of LUAD and LUSC lung cancer subtypes from TCGA and NYU cohorts. We show that PRL achieves a robust lung subtype prediction providing statistically relevant phenotypes for each lung subtype. We further demonstrate the significance of these phenotypes in lung adenocarcinoma overall and recurrence free survival, relating clusters with patient outcomes, cell types, grown patterns, and omic-based immune signatures.
    XLTime: A Cross-Lingual Knowledge Transfer Framework for Temporal Expression Extraction. (arXiv:2205.01757v1 [cs.CL])
    Temporal Expression Extraction (TEE) is essential for understanding time in natural language. It has applications in Natural Language Processing (NLP) tasks such as question answering, information retrieval, and causal inference. To date, work in this area has mostly focused on English as there is a scarcity of labeled data for other languages. We propose XLTime, a novel framework for multilingual TEE. XLTime works on top of pre-trained language models and leverages multi-task learning to prompt cross-language knowledge transfer both from English and within the non-English languages. XLTime alleviates problems caused by a shortage of data in the target language. We apply XLTime with different language models and show that it outperforms the previous automatic SOTA methods on French, Spanish, Portuguese, and Basque, by large margins. XLTime also closes the gap considerably on the handcrafted HeidelTime method.
    Discrete Simulation Optimization for Tuning Machine Learning Method Hyperparameters. (arXiv:2201.05978v2 [cs.LG] UPDATED)
    Machine learning (ML) methods are used in most technical areas such as image recognition, product recommendation, financial analysis, medical diagnosis, and predictive maintenance. An important aspect of implementing ML methods involves controlling the learning process for the ML method so as to maximize the performance of the method under consideration. Hyperparameter tuning is the process of selecting a suitable set of ML method parameters that control its learning process. In this work, we demonstrate the use of discrete simulation optimization methods such as ranking and selection (R&S) and random search for identifying a hyperparameter set that maximizes the performance of a ML method. Specifically, we use the KN R&S method and the stochastic ruler random search method and one of its variations for this purpose. We also construct the theoretical basis for applying the KN method, which determines the optimal solution with a statistical guarantee via solution space enumeration. In comparison, the stochastic ruler method asymptotically converges to global optima and incurs smaller computational overheads. We demonstrate the application of these methods to a wide variety of machine learning models, including deep neural network models used for time series prediction and image classification. We benchmark our application of these methods with state-of-the-art hyperparameter optimization libraries such as $hyperopt$ and $mango$. The KN method consistently outperforms $hyperopt$'s random search (RS) and Tree of Parzen Estimators (TPE) methods. The stochastic ruler method outperforms the $hyperopt$ RS method and offers statistically comparable performance with respect to $hyperopt$'s TPE method and the $mango$ algorithm.
    Growing Isotropic Neural Cellular Automata. (arXiv:2205.01681v1 [cs.NE])
    Modeling the ability of multicellular organisms to build and maintain their bodies through local interactions between individual cells (morphogenesis) is a long-standing challenge of developmental biology. Recently, the Neural Cellular Automata (NCA) model was proposed as a way to find local system rules that produce a desired global behaviour, such as growing and persisting a predefined pattern, by repeatedly applying the same rule over a grid starting from a single cell. In this work we argue that the original Growing NCA model has an important limitation: anisotropy of the learned update rule. This implies the presence of an external factor that orients the cells in a particular direction. In other words, 'physical' rules of the underlying system are not invariant to rotation, thus prohibiting the existence of differently oriented instances of the target pattern on the same grid. We propose a modified Isotropic NCA model that does not have this limitation. We demonstrate that cell systems can be trained to grow accurate asymmetrical patterns through either of two methods: by breaking symmetries using structured seeds; or by introducing a rotation-reflection invariant training objective and relying on symmetry breaking caused by asynchronous cell updates.
    fairlib: A Unified Framework for Assessing and Improving Classification Fairness. (arXiv:2205.01876v1 [cs.LG])
    This paper presents fairlib, an open-source framework for assessing and improving classification fairness. It provides a systematic framework for quickly reproducing existing baseline models, developing new methods, evaluating models with different metrics, and visualizing their results. Its modularity and extensibility enable the framework to be used for diverse types of inputs, including natural language, images, and audio. In detail, we implement 14 debiasing methods, including pre-processing, at-training-time, and post-processing approaches. The built-in metrics cover the most commonly used fairness criterion and can be further generalized and customized for fairness evaluation.
    Self-focusing virtual screening with active design space pruning. (arXiv:2205.01753v1 [q-bio.QM])
    High-throughput virtual screening is an indispensable technique utilized in the discovery of small molecules. In cases where the library of molecules is exceedingly large, the cost of an exhaustive virtual screen may be prohibitive. Model-guided optimization has been employed to lower these costs through dramatic increases in sample efficiency compared to random selection. However, these techniques introduce new costs to the workflow through the surrogate model training and inference steps. In this study, we propose an extension to the framework of model-guided optimization that mitigates inferences costs using a technique we refer to as design space pruning (DSP), which irreversibly removes poor-performing candidates from consideration. We study the application of DSP to a variety of optimization tasks and observe significant reductions in overhead costs while exhibiting similar performance to the baseline optimization. DSP represents an attractive extension of model-guided optimization that can limit overhead costs in optimization settings where these costs are non-negligible relative to objective costs, such as docking.
    Towards Theoretical Analysis of Transformation Complexity of ReLU DNNs. (arXiv:2205.01940v1 [cs.LG])
    This paper aims to theoretically analyze the complexity of feature transformations encoded in DNNs with ReLU layers. We propose metrics to measure three types of complexities of transformations based on the information theory. We further discover and prove the strong correlation between the complexity and the disentanglement of transformations. Based on the proposed metrics, we analyze two typical phenomena of the change of the transformation complexity during the training process, and explore the ceiling of a DNN's complexity. The proposed metrics can also be used as a loss to learn a DNN with the minimum complexity, which also controls the over-fitting level of the DNN and influences adversarial robustness, adversarial transferability, and knowledge consistency. Comprehensive comparative studies have provided new perspectives to understand the DNN.
    EllSeg: An Ellipse Segmentation Framework for Robust Gaze Tracking. (arXiv:2007.09600v2 [cs.CV] UPDATED)
    Ellipse fitting, an essential component in pupil or iris tracking based video oculography, is performed on previously segmented eye parts generated using various computer vision techniques. Several factors, such as occlusions due to eyelid shape, camera position or eyelashes, frequently break ellipse fitting algorithms that rely on well-defined pupil or iris edge segments. In this work, we propose training a convolutional neural network to directly segment entire elliptical structures and demonstrate that such a framework is robust to occlusions and offers superior pupil and iris tracking performance (at least 10$\%$ and 24$\%$ increase in pupil and iris center detection rate respectively within a two-pixel error margin) compared to using standard eye parts segmentation for multiple publicly available synthetic segmentation datasets.
    Self-Taught Metric Learning without Labels. (arXiv:2205.01903v1 [cs.CV])
    We present a novel self-taught framework for unsupervised metric learning, which alternates between predicting class-equivalence relations between data through a moving average of an embedding model and learning the model with the predicted relations as pseudo labels. At the heart of our framework lies an algorithm that investigates contexts of data on the embedding space to predict their class-equivalence relations as pseudo labels. The algorithm enables efficient end-to-end training since it demands no off-the-shelf module for pseudo labeling. Also, the class-equivalence relations provide rich supervisory signals for learning an embedding space. On standard benchmarks for metric learning, it clearly outperforms existing unsupervised learning methods and sometimes even beats supervised learning models using the same backbone network. It is also applied to semi-supervised metric learning as a way of exploiting additional unlabeled data, and achieves the state of the art by boosting performance of supervised learning substantially.
    Provably Confidential Language Modelling. (arXiv:2205.01863v1 [cs.CL])
    Large language models are shown to memorize privacy information such as social security numbers in training data. Given the sheer scale of the training corpus, it is challenging to screen and filter these privacy data, either manually or automatically. In this paper, we propose Confidentially Redacted Training (CRT), a method to train language generation models while protecting the confidential segments. We borrow ideas from differential privacy (which solves a related but distinct problem) and show that our method is able to provably prevent unintended memorization by randomizing parts of the training process. Moreover, we show that redaction with an approximately correct screening policy amplifies the confidentiality guarantee. We implement the method for both LSTM and GPT language models. Our experimental results show that the models trained by CRT obtain almost the same perplexity while preserving strong confidentiality.
    Differentiable Simulation of Soft Multi-body Systems. (arXiv:2205.01758v1 [cs.LG])
    We present a method for differentiable simulation of soft articulated bodies. Our work enables the integration of differentiable physical dynamics into gradient-based pipelines. We develop a top-down matrix assembly algorithm within Projective Dynamics and derive a generalized dry friction model for soft continuum using a new matrix splitting strategy. We derive a differentiable control framework for soft articulated bodies driven by muscles, joint torques, or pneumatic tubes. The experiments demonstrate that our designs make soft body simulation more stable and realistic compared to other frameworks. Our method accelerates the solution of system identification problems by more than an order of magnitude, and enables efficient gradient-based learning of motion control with soft robots.
    DeeptDCS: Deep Learning-Based Estimation of Currents Induced During Transcranial Direct Current Stimulation. (arXiv:2205.01858v1 [q-bio.QM])
    Objective: Transcranial direct current stimulation (tDCS) is a non-invasive brain stimulation technique used to generate conduction currents in the head and disrupt brain functions. To rapidly evaluate the tDCS-induced current density in near real-time, this paper proposes a deep learning-based emulator, named DeeptDCS. Methods: The emulator leverages Attention U-net taking the volume conductor models (VCMs) of head tissues as inputs and outputting the three-dimensional current density distribution across the entire head. The electrode configurations are also incorporated into VCMs without increasing the number of input channels; this enables the straightforward incorporation of the non-parametric features of electrodes (e.g., thickness, shape, size, and position) in the training and testing of the proposed emulator. Results: Attention U-net outperforms standard U-net and its other three variants (Residual U-net, Attention Residual U-net, and Multi-scale Residual U-net) in terms of accuracy. The generalization ability of DeeptDCS to non-trained electrode positions can be greatly enhanced through fine-tuning the model. The computational time required by one emulation via DeeptDCS is a fraction of a second. Conclusion: DeeptDCS is at least two orders of magnitudes faster than a physics-based open-source simulator, while providing satisfactorily accurate results. Significance: The high computational efficiency permits the use of DeeptDCS in applications requiring its repetitive execution, such as uncertainty quantification and optimization studies of tDCS.
    Uncertainty estimation of pedestrian future trajectory using Bayesian approximation. (arXiv:2205.01887v1 [cs.LG])
    Past research on pedestrian trajectory forecasting mainly focused on deterministic predictions which provide only point estimates of future states. These future estimates can help an autonomous vehicle plan its trajectory and avoid collision. However, under dynamic traffic scenarios, planning based on deterministic predictions is not trustworthy. Rather, estimating the uncertainty associated with the predicted states with a certain level of confidence can lead to robust path planning. Hence, the authors propose to quantify this uncertainty during forecasting using stochastic approximation which deterministic approaches fail to capture. The current method is simple and applies Bayesian approximation during inference to standard neural network architectures for estimating uncertainty. The authors compared the predictions between the probabilistic neural network (NN) models with the standard deterministic models. The results indicate that the mean predicted path of probabilistic models was closer to the ground truth when compared with the deterministic prediction. Further, the effect of stochastic dropout of weights and long-term prediction on future state uncertainty has been studied. It was found that the probabilistic models produced better performance metrics like average displacement error (ADE) and final displacement error (FDE). Finally, the study has been extended to multiple datasets providing a comprehensive comparison for each model.
    Crystal Twins: Self-supervised Learning for Crystalline Material Property Prediction. (arXiv:2205.01893v1 [cs.LG])
    Machine learning (ML) models have been widely successful in the prediction of material properties. However, large labeled datasets required for training accurate ML models are elusive and computationally expensive to generate. Recent advances in Self-Supervised Learning (SSL) frameworks capable of training ML models on unlabeled data have mitigated this problem and demonstrated superior performance in computer vision and natural language processing tasks. Drawing inspiration from the developments in SSL, we introduce Crystal Twins (CT): an SSL method for crystalline materials property prediction. Using a large unlabeled dataset, we pre-train a Graph Neural Network (GNN) by applying the redundancy reduction principle to the graph latent embeddings of augmented instances obtained from the same crystalline system. By sharing the pre-trained weights when fine-tuning the GNN for regression tasks, we significantly improve the performance for 7 challenging material property prediction benchmarks
    Branch & Learn for Recursively and Iteratively Solvable Problems in Predict+Optimize. (arXiv:2205.01672v1 [cs.LG])
    This paper proposes Branch & Learn, a framework for Predict+Optimize to tackle optimization problems containing parameters that are unknown at the time of solving. Given an optimization problem solvable by a recursive algorithm satisfying simple conditions, we show how a corresponding learning algorithm can be constructed directly and methodically from the recursive algorithm. Our framework applies also to iterative algorithms by viewing them as a degenerate form of recursion. Extensive experimentation shows better performance for our proposal over classical and state-of-the-art approaches.
    Deep Sequence Modeling for Anomalous ISP Traffic Prediction. (arXiv:2205.01685v1 [cs.LG])
    Internet traffic in the real world is susceptible to various external and internal factors which may abruptly change the normal traffic flow. Those unexpected changes are considered outliers in traffic. However, deep sequence models have been used to predict complex IP traffic, but their comparative performance for anomalous traffic has not been studied extensively. In this paper, we investigated and evaluated the performance of different deep sequence models for anomalous traffic prediction. Several deep sequences models were implemented to predict real traffic without and with outliers and show the significance of outlier detection in real-world traffic prediction. First, two different outlier detection techniques, such as the Three-Sigma rule and Isolation Forest, were applied to identify the anomaly. Second, we adjusted those abnormal data points using the Backward Filling technique before training the model. Finally, the performance of different models was compared for abnormal and adjusted traffic. LSTM_Encoder_Decoder (LSTM_En_De) is the best prediction model in our experiment, reducing the deviation between actual and predicted traffic by more than 11\% after adjusting the outliers. All other models, including Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), LSTM_En_De with Attention layer (LSTM_En_De_Atn), Gated Recurrent Unit (GRU), show better prediction after replacing the outliers and decreasing prediction error by more than 29%, 24%, 19%, and 10% respectively. Our experimental results indicate that the outliers in the data can significantly impact the quality of the prediction. Thus, outlier detection and mitigation assist the deep sequence model in learning the general trend and making better predictions.
    Frequency Domain-Based Detection of Generated Audio. (arXiv:2205.01806v1 [cs.SD])
    Attackers may manipulate audio with the intent of presenting falsified reports, changing an opinion of a public figure, and winning influence and power. The prevalence of inauthentic multimedia continues to rise, so it is imperative to develop a set of tools that determines the legitimacy of media. We present a method that analyzes audio signals to determine whether they contain real human voices or fake human voices (i.e., voices generated by neural acoustic and waveform models). Instead of analyzing the audio signals directly, the proposed approach converts the audio signals into spectrogram images displaying frequency, intensity, and temporal content and evaluates them with a Convolutional Neural Network (CNN). Trained on both genuine human voice signals and synthesized voice signals, we show our approach achieves high accuracy on this classification task.
    Spatial-Temporal Meta-path Guided Explainable Crime Prediction. (arXiv:2205.01901v1 [cs.LG])
    Exposure to crime and violence can harm individuals' quality of life and the economic growth of communities. In light of the rapid development in machine learning, there is a rise in the need to explore automated solutions to prevent crimes. With the increasing availability of both fine-grained urban and public service data, there is a recent surge in fusing such cross-domain information to facilitate crime prediction. By capturing the information about social structure, environment, and crime trends, existing machine learning predictive models have explored the dynamic crime patterns from different views. However, these approaches mostly convert such multi-source knowledge into implicit and latent representations (e.g., learned embeddings of districts), making it still a challenge to investigate the impacts of explicit factors for the occurrences of crimes behind the scenes. In this paper, we present a Spatial-Temporal Metapath guided Explainable Crime prediction (STMEC) framework to capture dynamic patterns of crime behaviours and explicitly characterize how the environmental and social factors mutually interact to produce the forecasts. Extensive experiments show the superiority of STMEC compared with other advanced spatiotemporal models, especially in predicting felonies (e.g., robberies and assaults with dangerous weapons).
    ASTROMER: A transformer-based embedding for the representation of light curves. (arXiv:2205.01677v1 [astro-ph.IM])
    Taking inspiration from natural language embeddings, we present ASTROMER, a transformer-based model to create representations of light curves. ASTROMER was trained on millions of MACHO R-band samples, and it can be easily fine-tuned to match specific domains associated with downstream tasks. As an example, this paper shows the benefits of using pre-trained representations to classify variable stars. In addition, we provide a python library including all functionalities employed in this work. Our library includes the pre-trained models that can be used to enhance the performance of deep learning models, decreasing computational resources while achieving state-of-the-art results.
    Don't sweat the small stuff, classify the rest: Sample Shielding to protect text classifiers against adversarial attacks. (arXiv:2205.01714v1 [cs.CL])
    Deep learning (DL) is being used extensively for text classification. However, researchers have demonstrated the vulnerability of such classifiers to adversarial attacks. Attackers modify the text in a way which misleads the classifier while keeping the original meaning close to intact. State-of-the-art (SOTA) attack algorithms follow the general principle of making minimal changes to the text so as to not jeopardize semantics. Taking advantage of this we propose a novel and intuitive defense strategy called Sample Shielding. It is attacker and classifier agnostic, does not require any reconfiguration of the classifier or external resources and is simple to implement. Essentially, we sample subsets of the input text, classify them and summarize these into a final decision. We shield three popular DL text classifiers with Sample Shielding, test their resilience against four SOTA attackers across three datasets in a realistic threat setting. Even when given the advantage of knowing about our shielding strategy the adversary's attack success rate is <=10% with only one exception and often < 5%. Additionally, Sample Shielding maintains near original accuracy when applied to original texts. Crucially, we show that the `make minimal changes' approach of SOTA attackers leads to critical vulnerabilities that can be defended against with an intuitive sampling strategy.
    The ICML 2022 Expressive Vocalizations Workshop and Competition: Recognizing, Generating, and Personalizing Vocal Bursts. (arXiv:2205.01780v1 [eess.AS])
    The ICML Expressive Vocalization (ExVo) Competition is focused on understanding and generating vocal bursts: laughs, gasps, cries, and other non-verbal vocalizations that are central to emotional expression and communication. ExVo 2022, includes three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers. The first, ExVo-MultiTask, requires participants to train a multi-task model to recognize expressed emotions and demographic traits from vocal bursts. The second, ExVo-Generate, requires participants to train a generative model that produces vocal bursts conveying ten different emotions. The third, ExVo-FewShot, requires participants to leverage few-shot learning incorporating speaker identity to train a model for the recognition of 10 emotions conveyed by vocal bursts. This paper describes the three tracks and provides performance measures for baseline models using state-of-the-art machine learning strategies. The baseline for each track is as follows, for ExVo-MultiTask, a combined score, computing the harmonic mean of Concordance Correlation Coefficient (CCC), Unweighted Average Recall (UAR), and inverted Mean Absolute Error (MAE) ($S_{MTL}$) is at best, 0.335 $S_{MTL}$; for ExVo-Generate, we report Fr\'echet inception distance (FID) scores ranging from 4.81 to 8.27 (depending on the emotion) between the training set and generated samples. We then combine the inverted FID with perceptual ratings of the generated samples ($S_{Gen}$) and obtain 0.174 $S_{Gen}$; and for ExVo-FewShot, a mean CCC of 0.444 is obtained.
    Learning Abstract and Transferable Representations for Planning. (arXiv:2205.02092v1 [cs.LG])
    We are concerned with the question of how an agent can acquire its own representations from sensory data. We restrict our focus to learning representations for long-term planning, a class of problems that state-of-the-art learning methods are unable to solve. We propose a framework for autonomously learning state abstractions of an agent's environment, given a set of skills. Importantly, these abstractions are task-independent, and so can be reused to solve new tasks. We demonstrate how an agent can use an existing set of options to acquire representations from ego- and object-centric observations. These abstractions can immediately be reused by the same agent in new environments. We show how to combine these portable representations with problem-specific ones to generate a sound description of a specific task that can be used for abstract planning. Finally, we show how to autonomously construct a multi-level hierarchy consisting of increasingly abstract representations. Since these hierarchies are transferable, higher-order concepts can be reused in new tasks, relieving the agent from relearning them and improving sample efficiency. Our results demonstrate that our approach allows an agent to transfer previous knowledge to new tasks, improving sample efficiency as the number of tasks increases.
    MemSE: Fast MSE Prediction for Noisy Memristor-Based DNN Accelerators. (arXiv:2205.01707v1 [cs.LG])
    Memristors enable the computation of matrix-vector multiplications (MVM) in memory and, therefore, show great potential in highly increasing the energy efficiency of deep neural network (DNN) inference accelerators. However, computations in memristors suffer from hardware non-idealities and are subject to different sources of noise that may negatively impact system performance. In this work, we theoretically analyze the mean squared error of DNNs that use memristor crossbars to compute MVM. We take into account both the quantization noise, due to the necessity of reducing the DNN model size, and the programming noise, stemming from the variability during the programming of the memristance value. Simulations on pre-trained DNN models showcase the accuracy of the analytical prediction. Furthermore the proposed method is almost two order of magnitude faster than Monte-Carlo simulation, thus making it possible to optimize the implementation parameters to achieve minimal error for a given power constraint.
    FastMapSVM: Classifying Complex Objects Using the FastMap Algorithm and Support-Vector Machines. (arXiv:2204.05112v2 [cs.CV] UPDATED)
    Neural Networks and related Deep Learning methods are currently at the leading edge of technologies used for classifying objects. However, they generally demand large amounts of time and data for model training; and their learned models can sometimes be difficult to interpret. In this paper, we re-introduce FastMapSVM, an interpretable Machine Learning framework for classifying complex objects. FastMapSVM combines the strengths of FastMap and Support-Vector Machines. FastMap is an efficient linear-time algorithm that maps complex objects to points in a Euclidean space, while preserving pairwise non-Euclidean distances between them. We demonstrate the efficiency and effectiveness of FastMapSVM in the context of classifying seismograms. We show that its performance, in terms of precision, recall, and accuracy, is comparable to that of other state-of-the-art methods. However, compared to other methods, FastMapSVM uses significantly smaller amounts of time and data for model training. It also provides a perspicuous visualization of the objects and the classification boundaries between them. We expect FastMapSVM to be viable for classification tasks in many other real-world domains.
    SVTS: Scalable Video-to-Speech Synthesis. (arXiv:2205.02058v1 [cs.SD])
    Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contemporary video-to-speech works focus mainly on small- to medium-sized corpora with substantial constraints in both vocabulary and setting. In this work, we introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio. We achieve state-of-the art results for GRID and considerably outperform previous approaches on LRW. More importantly, by focusing on spectrogram prediction using a simple feedforward model, we can efficiently and effectively scale our method to very large and unconstrained datasets: To the best of our knowledge, we are the first to show intelligible results on the challenging LRS3 dataset.
    Can Rationalization Improve Robustness?. (arXiv:2204.11790v2 [cs.CL] UPDATED)
    A growing line of work has investigated the development of neural NLP models that can produce rationales--subsets of input that can explain their model predictions. In this paper, we ask whether such rationale models can also provide robustness to adversarial attacks in addition to their interpretable nature. Since these models need to first generate rationales ("rationalizer") before making predictions ("predictor"), they have the potential to ignore noise or adversarially added text by simply masking it out of the generated rationale. To this end, we systematically generate various types of 'AddText' attacks for both token and sentence-level rationalization tasks, and perform an extensive empirical evaluation of state-of-the-art rationale models across five different tasks. Our experiments reveal that the rationale models show the promise to improve robustness, while they struggle in certain scenarios--when the rationalizer is sensitive to positional bias or lexical choices of attack text. Further, leveraging human rationale as supervision does not always translate to better performance. Our study is a first step towards exploring the interplay between interpretability and robustness in the rationalize-then-predict framework.
    Probabilistic Symmetry for Improved Trajectory Forecasting. (arXiv:2205.01927v1 [cs.LG])
    Trajectory prediction is a core AI problem with broad applications in robotics and autonomous driving. While most existing works focus on deterministic prediction, producing probabilistic forecasts to quantify prediction uncertainty is critical for downstream decision-making tasks such as risk assessment, motion planning, and safety guarantees. We introduce a new metric, mean regional score (MRS), to evaluate the quality of probabilistic trajectory forecasts. We propose a novel probabilistic trajectory prediction model, Probabilistic Equivariant Continuous COnvolution (PECCO) and show that leveraging symmetry, specifically rotation equivariance, can improve the predictions' accuracy as well as coverage. On both vehicle and pedestrian datasets, PECCO shows state-of-the-art prediction performance and improved calibration compared to baselines.
    A Lightweight and Accurate Spatial-Temporal Transformer for Traffic Forecasting. (arXiv:2201.00008v3 [cs.LG] UPDATED)
    We study the forecasting problem for traffic with dynamic, possibly periodical, and joint spatial-temporal dependency between regions. Given the aggregated inflow and outflow traffic of regions in a city from time slots 0 to t-1, we predict the traffic at time t at any region. Prior arts in the area often consider the spatial and temporal dependencies in a decoupled manner or are rather computationally intensive in training with a large number of hyper-parameters to tune. We propose ST-TIS, a novel, lightweight, and accurate Spatial-Temporal Transformer with information fusion and region sampling for traffic forecasting. ST-TIS extends the canonical Transformer with information fusion and region sampling. The information fusion module captures the complex spatial-temporal dependency between regions. The region sampling module is to improve the efficiency and prediction accuracy, cutting the computation complexity for dependency learning from $O(n^2)$ to $O(n\sqrt{n})$, where n is the number of regions. With far fewer parameters than state-of-the-art models, the offline training of our model is significantly faster in terms of tuning and computation (with a reduction of up to $90\%$ on training time and network parameters). Notwithstanding such training efficiency, extensive experiments show that ST-TIS is substantially more accurate in online prediction than state-of-the-art approaches (with an average improvement of up to $9.5\%$ on RMSE, and $12.4\%$ on MAPE).
    Root-aligned SMILES: A Tight Representation for Chemical Reaction Prediction. (arXiv:2203.11444v3 [cs.LG] UPDATED)
    Chemical reaction prediction, involving forward synthesis and retrosynthesis prediction, is a fundamental problem in organic synthesis. A popular computational paradigm formulates synthesis prediction as a sequence-to-sequence translation problem, where the typical SMILES is adopted for molecule representations. However, the general-purpose SMILES neglects the characteristics of chemical reactions, where the molecular graph topology is largely unaltered from reactants to products, resulting in the suboptimal performance of SMILES if straightforwardly applied. In this article, we propose the root-aligned SMILES (R-SMILES), which specifies a tightly aligned one-to-one mapping between the product and the reactant SMILES for more efficient synthesis prediction. Due to the strict one-to-one mapping and reduced edit distance, the computational model is largely relieved from learning the complex syntax and dedicated to learning the chemical knowledge for reactions. We compare the proposed R-SMILES with various state-of-the-art baselines and show that it significantly outperforms them all, demonstrating the superiority of the proposed method.
    Optimizing Mixture of Experts using Dynamic Recompilations. (arXiv:2205.01848v1 [cs.LG])
    The Mixture of Experts architecture allows for outrageously large neural networks by scaling model parameter size independently from computational demand (FLOPs). However, current DNN frameworks cannot effectively support the dynamic data flow in Mixture of Experts, and implementations on top of these frameworks need to use workarounds that introduce significant overheads. To address the limitation of these frameworks, we present DynaMoE, a DNN library that uses dynamic recompilations to optimize and adapt the use of computational resources to the dynamic needs of Mixture of Experts models. Our evaluation shows that DynaMoE achieves a 1.8x speedup and supports 2.3x larger model sizes when compared to existing MoE systems, even when not using recompilations. We then present further optimizations enabled by dynamic recompilations that yield an additional 1.7x speedup while simultaneously reducing memory pressure and improving model quality.
    DIAS: A Domain-Independent Alife-Based Problem-Solving System. (arXiv:2203.06855v2 [cs.NE] UPDATED)
    A domain-independent problem-solving system based on principles of Artificial Life is introduced. In this system, DIAS, the input and output dimensions of the domain are laid out in a spatial medium. A population of actors, each seeing only part of this medium, solves problems collectively in it. The process is independent of the domain and can be implemented through different kinds of actors. Through a set of experiments on various problem domains, DIAS is shown able to solve problems with different dimensionality and complexity, to require no hyperparameter tuning for new problems, and to exhibit lifelong learning, i.e. adapt rapidly to run-time changes in the problem domain, and do it better than a standard non-collective approach. DIAS therefore demonstrates a role for Alife in building scalable, general, and adaptive problem-solving systems.
    Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem. (arXiv:2205.01954v1 [cs.CL])
    Word embeddings are one of the most fundamental technologies used in natural language processing. Existing word embeddings are high-dimensional and consume considerable computational resources. In this study, we propose WordTour, unsupervised one-dimensional word embeddings. To achieve the challenging goal, we propose a decomposition of the desiderata of word embeddings into two parts, completeness and soundness, and focus on soundness in this paper. Owing to the single dimensionality, WordTour is extremely efficient and provides a minimal means to handle word embeddings. We experimentally confirmed the effectiveness of the proposed method via user study and document classification.
    VICE: Variational Inference for Concept Embeddings. (arXiv:2205.00756v3 [cs.LG] UPDATED)
    In this paper, we introduce Variational Inference for Concept Embeddings (VICE), an approximate Bayesian method for learning object concept embeddings from human behavior in an odd-one-out triplet task. We use variational inference to obtain a sparse, non-negative solution with uncertainty estimates about each embedding value. We exploit these estimates to automatically select the dimensions that explain the data while yielding reproducible embeddings. We introduce a PAC learning bound for VICE that can be used to estimate generalization performance or determine a sufficient sample size for different experimental designs. VICE rivals or outperforms its predecessor, SPoSE, at predicting human behavior in a triplet task. VICE object representations are substantially more reproducible and consistent across different random initializations.
    The scope for AI-augmented interpretation of building blueprints in commercial and industrial property insurance. (arXiv:2205.01671v1 [cs.CV])
    This report, commissioned by the WTW research network, investigates the use of AI in property risk assessment. It (i) reviews existing work on risk assessment in commercial and industrial properties and automated information extraction from building blueprints; and (ii) presents an exploratory 'proof-of concept-solution' exploring the feasibility of using machine learning for the automated extraction of information from building blueprints to support insurance risk assessment.
    Finding patterns in Knowledge Attribution for Transformers. (arXiv:2205.01366v2 [cs.CL] UPDATED)
    We analyze the Knowledge Neurons framework for the attribution of factual and relational knowledge to particular neurons in the transformer network. We use a 12-layer multi-lingual BERT model for our experiments. Our study reveals various interesting phenomena. We observe that mostly factual knowledge can be attributed to middle and higher layers of the network($\ge 6$). Further analysis reveals that the middle layers($6-9$) are mostly responsible for relational information, which is further refined into actual factual knowledge or the "correct answer" in the last few layers($10-12$). Our experiments also show that the model handles prompts in different languages, but representing the same fact, similarly, providing further evidence for effectiveness of multi-lingual pre-training. Applying the attribution scheme for grammatical knowledge, we find that grammatical knowledge is far more dispersed among the neurons than factual knowledge.
    Cross-Loss Influence Functions to Explain Deep Network Representations. (arXiv:2012.01685v2 [cs.LG] UPDATED)
    As machine learning is increasingly deployed in the real world, it is paramount that we develop the tools necessary to analyze the decision-making of the models we train and deploy to end-users. Recently, researchers have shown that influence functions, a statistical measure of sample impact, can approximate the effects of training samples on classification accuracy for deep neural networks. However, this prior work only applies to supervised learning, where training and testing share an objective function. No approaches currently exist for estimating the influence of unsupervised training examples for deep learning models. To bring explainability to unsupervised and semi-supervised training regimes, we derive the first theoretical and empirical demonstration that influence functions can be extended to handle mismatched training and testing (i.e., "cross-loss") settings. Our formulation enables us to compute the influence in an unsupervised learning setup, explain cluster memberships, and identify and augment biases in language models. Our experiments show that our cross-loss influence estimates even exceed matched-objective influence estimation relative to ground-truth sample impact.
    Processing Network Controls via Deep Reinforcement Learning. (arXiv:2205.02119v1 [math.OC])
    Novel advanced policy gradient (APG) algorithms, such as proximal policy optimization (PPO), trust region policy optimization, and their variations, have become the dominant reinforcement learning (RL) algorithms because of their ease of implementation and good practical performance. This dissertation is concerned with theoretical justification and practical application of the APG algorithms for solving processing network control optimization problems. Processing network control problems are typically formulated as Markov decision process (MDP) or semi-Markov decision process (SMDP) problems that have several unconventional for RL features: infinite state spaces, unbounded costs, long-run average cost objectives. Policy improvement bounds play a crucial role in the theoretical justification of the APG algorithms. In this thesis we refine existing bounds for MDPs with finite state spaces and prove novel policy improvement bounds for classes of MDPs and SMDPs used to model processing network operations. We consider two examples of processing network control problems and customize the PPO algorithm to solve them. First, we consider parallel-server and multiclass queueing networks controls. Second, we consider the drivers repositioning problem in a ride-hailing service system. For both examples the PPO algorithm with auxiliary modifications consistently generates control policies that outperform state-of-art heuristics.
    Accelerating Inhibitor Discovery for Multiple SARS-CoV-2 Targets with a Single, Sequence-Guided Deep Generative Framework. (arXiv:2204.09042v2 [q-bio.QM] UPDATED)
    The COVID-19 pandemic has highlighted the urgency for developing more efficient molecular discovery pathways. As exhaustive exploration of the vast chemical space is infeasible, discovering novel inhibitor molecules for emerging drug-target proteins is challenging, particularly for targets with unknown structure or ligands. We demonstrate the broad utility of a single deep generative framework toward discovering novel drug-like inhibitor molecules against two distinct SARS-CoV-2 targets -- the main protease (Mpro) and the receptor binding domain (RBD) of the spike protein. To perform target-aware design, the framework employs a target sequence-conditioned sampling of novel molecules from a generative model. Micromolar-level in vitro inhibition was observed for two candidates (out of four synthesized) for each target. The most potent spike RBD inhibitor also emerged as a rare non-covalent antiviral with broad-spectrum activity against several SARS-CoV-2 variants in live virus neutralization assays. These results show that a broadly deployable machine intelligence framework can accelerate hit discovery across different emerging drug-targets.
    An improved central limit theorem and fast convergence rates for entropic transportation costs. (arXiv:2204.09105v2 [math.ST] UPDATED)
    We prove a central limit theorem for the entropic transportation cost between subgaussian probability measures, centered at the population cost. This is the first result which allows for asymptotically valid inference for entropic optimal transport between measures which are not necessarily discrete. In the compactly supported case, we complement these results with new, faster, convergence rates for the expected entropic transportation cost between empirical measures. Our proof is based on strengthening convergence results for dual solutions to the entropic optimal transport problem.
    Prediction of fish location by combining fisheries data and sea bottom temperature forecasting. (arXiv:2205.02107v1 [cs.CV])
    This paper combines fisheries dependent data and environmental data to be used in a machine learning pipeline to predict the spatio-temporal abundance of two species (plaice and sole) commonly caught by the Belgian fishery in the North Sea. By combining fisheries related features with environmental data, sea bottom temperature derived from remote sensing, a higher accuracy can be achieved. In a forecast setting, the predictive accuracy is further improved by predicting, using a recurrent deep neural network, the sea bottom temperature up to four days in advance instead of relying on the last previous temperature measurement.
    Predicting vacant parking space availability zone-wisely: a graph based spatio-temporal prediction approach. (arXiv:2205.02113v1 [cs.LG])
    Vacant parking space (VPS) prediction is one of the key issues of intelligent parking guidance systems. Accurately predicting VPS information plays a crucial role in intelligent parking guidance systems, which can help drivers find parking space quickly, reducing unnecessary waste of time and excessive environmental pollution. Through the simple analysis of historical data, we found that there not only exists a obvious temporal correlation in each parking lot, but also a clear spatial correlation between different parking lots. In view of this, this paper proposed a graph data-based model ST-GBGRU (Spatial-Temporal Graph Based Gated Recurrent Unit), the number of VPSs can be predicted both in short-term (i.e., within 30 min) and in long-term (i.e., over 30min). On the one hand, the temporal correlation of historical VPS data is extracted by GRU, on the other hand, the spatial correlation of historical VPS data is extracted by GCN inside GRU. Two prediction methods, namely direct prediction and iterative prediction, are combined with the proposed model. Finally, the prediction model is applied to predict the number VPSs of 8 public parking lots in Santa Monica. The results show that in the short-term and long-term prediction tasks, ST-GBGRU model can achieve high accuracy and have good application prospects.
    Domino Saliency Metrics: Improving Existing Channel Saliency Metrics with Structural Information. (arXiv:2205.02131v1 [cs.CV])
    Channel pruning is used to reduce the number of weights in a Convolutional Neural Network (CNN). Channel pruning removes slices of the weight tensor so that the convolution layer remains dense. The removal of these weight slices from a single layer causes mismatching number of feature maps between layers of the network. A simple solution is to force the number of feature map between layers to match through the removal of weight slices from subsequent layers. This additional constraint becomes more apparent in DNNs with branches where multiple channels need to be pruned together to keep the network dense. Popular pruning saliency metrics do not factor in the structural dependencies that arise in DNNs with branches. We propose Domino metrics (built on existing channel saliency metrics) to reflect these structural constraints. We test Domino saliency metrics against the baseline channel saliency metrics on multiple networks with branches. Domino saliency metrics improved pruning rates in most tested networks and up to 25% in AlexNet on CIFAR-10.
    A Game-Theoretic Approach for Improving Generalization Ability of TSP Solvers. (arXiv:2110.15105v3 [cs.LG] UPDATED)
    In this paper, we introduce a two-player zero-sum framework between a trainable \emph{Solver} and a \emph{Data Generator} to improve the generalization ability of deep learning-based solvers for Traveling Salesman Problem (TSP). Grounded in \textsl{Policy Space Response Oracle} (PSRO) methods, our two-player framework outputs a population of best-responding Solvers, over which we can mix and output a combined model that achieves the least exploitability against the Generator, and thereby the most generalizable performance on different TSP tasks. We conduct experiments on a variety of TSP instances with different types and sizes. Results suggest that our Solvers achieve the state-of-the-art performance even on tasks the Solver never meets, whilst the performance of other deep learning-based Solvers drops sharply due to over-fitting. To demonstrate the principle of our framework, we study the learning outcome of the proposed two-player game and demonstrate that the exploitability of the Solver population decreases during training, and it eventually approximates the Nash equilibrium along with the Generator.
    Modelling calibration uncertainty in networks of environmental sensors. (arXiv:2205.01988v1 [cs.LG])
    Networks of low-cost sensors are becoming ubiquitous, but often suffer from low accuracies and drift. Regular colocation with reference sensors allows recalibration but is often complicated and expensive. Alternatively the calibration can be transferred using low-cost, mobile sensors, often at very low cost. However inferring appropriate estimates of the calibration functions (with uncertainty) for the network of sensors becomes difficult, especially as the network of visits by the mobile, low-cost sensors becomes large. We propose a variational approach to model the calibration across the network of sensors. We demonstrate the approach on both synthetic and real air pollution data, and find it can perform better than the state of the art (multi-hop calibration). We extend it to categorical data, combining classifications of insects by non-expert citizen scientists. Achieving uncertainty-quantified calibration has been one of the major barriers to low-cost sensor deployment and citizen-science research. We hope that the methods described will enable such projects.
    On Circuit Depth Scaling For Quantum Approximate Optimization. (arXiv:2205.01698v1 [quant-ph])
    Variational quantum algorithms are the centerpiece of modern quantum programming. These algorithms involve training parameterized quantum circuits using a classical co-processor, an approach adapted partly from classical machine learning. An important subclass of these algorithms, designed for combinatorial optimization on currrent quantum hardware, is the quantum approximate optimization algorithm (QAOA). It is known that problem density - a problem constraint to variable ratio - induces under-parametrization in fixed depth QAOA. Density dependent performance has been reported in the literature, yet the circuit depth required to achieve fixed performance (henceforth called critical depth) remained unknown. Here, we propose a predictive model, based on a logistic saturation conjecture for critical depth scaling with respect to density. Focusing on random instances of MAX-2-SAT, we test our predictive model against simulated data with up to 15 qubits. We report the average critical depth, required to attain a success probability of 0.7, saturates at a value of 10 for densities beyond 4. We observe the predictive model to describe the simulated data within a $3\sigma$ confidence interval. Furthermore, based on the model, a linear trend for the critical depth with respect problem size is recovered for the range of 5 to 15 qubits.
    EmoBank: Studying the Impact of Annotation Perspective and Representation Format on Dimensional Emotion Analysis. (arXiv:2205.01996v1 [cs.CL])
    We describe EmoBank, a corpus of 10k English sentences balancing multiple genres, which we annotated with dimensional emotion metadata in the Valence-Arousal-Dominance (VAD) representation format. EmoBank excels with a bi-perspectival and bi-representational design. On the one hand, we distinguish between writer's and reader's emotions, on the other hand, a subset of the corpus complements dimensional VAD annotations with categorical ones based on Basic Emotions. We find evidence for the supremacy of the reader's perspective in terms of IAA and rating intensity, and achieve close-to-human performance when mapping between dimensional and categorical formats.
    A Manifold Two-Sample Test Study: Integral Probability Metric with Neural Networks. (arXiv:2205.02043v1 [stat.ML])
    Two-sample tests are important areas aiming to determine whether two collections of observations follow the same distribution or not. We propose two-sample tests based on integral probability metric (IPM) for high-dimensional samples supported on a low-dimensional manifold. We characterize the properties of proposed tests with respect to the number of samples $n$ and the structure of the manifold with intrinsic dimension $d$. When an atlas is given, we propose two-step test to identify the difference between general distributions, which achieves the type-II risk in the order of $n^{-1/\max\{d,2\}}$. When an atlas is not given, we propose H\"older IPM test that applies for data distributions with $(s,\beta)$-H\"older densities, which achieves the type-II risk in the order of $n^{-(s+\beta)/d}$. To mitigate the heavy computation burden of evaluating the H\"older IPM, we approximate the H\"older function class using neural networks. Based on the approximation theory of neural networks, we show that the neural network IPM test has the type-II risk in the order of $n^{-(s+\beta)/d}$, which is in the same order of the type-II risk as the H\"older IPM test. Our proposed tests are adaptive to low-dimensional geometric structure because their performance crucially depends on the intrinsic dimension instead of the data dimension.
    FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization. (arXiv:2205.02215v1 [cs.LG])
    Standard federated optimization methods successfully apply to stochastic problems with \textit{single-level} structure. However, many contemporary ML problems -- including adversarial robustness, hyperparameter tuning, and actor-critic -- fall under nested bilevel programming that subsumes minimax and compositional optimization. In this work, we propose FedNest: A federated alternating stochastic gradient method to address general nested problems. We establish provable convergence rates for FedNest in the presence of heterogeneous data and introduce variations for bilevel, minimax, and compositional optimization. FedNest introduces multiple innovations including federated hypergradient computation and variance reduction to address inner-level heterogeneity. We complement our theory with experiments on hyperparameter \& hyper-representation learning and minimax optimization that demonstrate the benefits of our method in practice. Code is available at https://github.com/mc-nya/FedNest.
    Generalized Multi-Output Gaussian Process Censored Regression. (arXiv:2009.04822v2 [stat.ML] UPDATED)
    When modelling censored observations, a typical approach in current regression methods is to use a censored-Gaussian (i.e. Tobit) model to describe the conditional output distribution. In this paper, as in the case of missing data, we argue that exploiting correlations between multiple outputs can enable models to better address the bias introduced by censored data. To do so, we introduce a heteroscedastic multi-output Gaussian process model which combines the non-parametric flexibility of GPs with the ability to leverage information from correlated outputs under input-dependent noise conditions. To address the resulting inference intractability, we further devise a variational bound to the marginal log-likelihood suitable for stochastic optimization. We empirically evaluate our model against other generative models for censored data on both synthetic and real world tasks and further show how it can be generalized to deal with arbitrary likelihood functions. Results show how the added flexibility allows our model to better estimate the underlying non-censored (i.e. true) process under potentially complex censoring dynamics.
    Balsa: Learning a Query Optimizer Without Expert Demonstrations. (arXiv:2201.01441v2 [cs.DB] UPDATED)
    Query optimizers are a performance-critical component in every database system. Due to their complexity, optimizers take experts months to write and years to refine. In this work, we demonstrate for the first time that learning to optimize queries without learning from an expert optimizer is both possible and efficient. We present Balsa, a query optimizer built by deep reinforcement learning. Balsa first learns basic knowledge from a simple, environment-agnostic simulator, followed by safe learning in real execution. On the Join Order Benchmark, Balsa matches the performance of two expert query optimizers, both open-source and commercial, with two hours of learning, and outperforms them by up to 2.8$\times$ in workload runtime after a few more hours. Balsa thus opens the possibility of automatically learning to optimize in future compute environments where expert-designed optimizers do not exist.
    Policy Optimization Using Semi-parametric Models for Dynamic Pricing. (arXiv:2109.06368v2 [cs.LG] UPDATED)
    In this paper, we study the contextual dynamic pricing problem where the market value of a product is linear in its observed features plus some market noise. Products are sold one at a time, and only a binary response indicating success or failure of a sale is observed. Our model setting is similar to Javanmard and Nazerzadeh [2019] except that we expand the demand curve to a semiparametric model and need to learn dynamically both parametric and nonparametric components. We propose a dynamic statistical learning and decision-making policy that combines semiparametric estimation from a generalized linear model with an unknown link and online decision-making to minimize regret (maximize revenue). Under mild conditions, we show that for a market noise c.d.f. $F(\cdot)$ with $m$-th order derivative ($m\geq 2$), our policy achieves a regret upper bound of $\tilde{O}_{d}(T^{\frac{2m+1}{4m-1}})$, where $T$ is time horizon and $\tilde{O}_{d}$ is the order that hides logarithmic terms and the dimensionality of feature $d$. The upper bound is further reduced to $\tilde{O}_{d}(\sqrt{T})$ if $F$ is super smooth whose Fourier transform decays exponentially. In terms of dependence on the horizon $T$, these upper bounds are close to $\Omega(\sqrt{T})$, the lower bound where $F$ belongs to a parametric class. We further generalize these results to the case with dynamically dependent product features under the strong mixing condition.
    MAD: Self-Supervised Masked Anomaly Detection Task for Multivariate Time Series. (arXiv:2205.02100v1 [cs.LG])
    In this paper, we introduce Masked Anomaly Detection (MAD), a general self-supervised learning task for multivariate time series anomaly detection. With the increasing availability of sensor data from industrial systems, being able to detecting anomalies from streams of multivariate time series data is of significant importance. Given the scarcity of anomalies in real-world applications, the majority of literature has been focusing on modeling normality. The learned normal representations can empower anomaly detection as the model has learned to capture certain key underlying data regularities. A typical formulation is to learn a predictive model, i.e., use a window of time series data to predict future data values. In this paper, we propose an alternative self-supervised learning task. By randomly masking a portion of the inputs and training a model to estimate them using the remaining ones, MAD is an improvement over the traditional left-to-right next step prediction (NSP) task. Our experimental results demonstrate that MAD can achieve better anomaly detection rates over traditional NSP approaches when using exactly the same neural network (NN) base models, and can be modified to run as fast as NSP models during test time on the same hardware, thus making it an ideal upgrade for many existing NSP-based NN anomaly detection models.
    Accelerating phase-field-based simulation via machine learning. (arXiv:2205.02121v1 [cond-mat.mtrl-sci])
    Phase-field-based models have become common in material science, mechanics, physics, biology, chemistry, and engineering for the simulation of microstructure evolution. Yet, they suffer from the drawback of being computationally very costly when applied to large, complex systems. To reduce such computational costs, a Unet-based artificial neural network is developed as a surrogate model in the current work. Training input for this network is obtained from the results of the numerical solution of initial-boundary-value problems (IBVPs) based on the Fan-Chen model for grain microstructure evolution. In particular, about 250 different simulations with varying initial order parameters are carried out and 200 frames of the time evolution of the phase fields are stored for each simulation. The network is trained with 90% of this data, taking the $i$-th frame of a simulation, i.e. order parameter field, as input, and producing the $(i+1)$-th frame as the output. Evaluation of the network is carried out with a test dataset consisting of 2200 microstructures based on different configurations than originally used for training. The trained network is applied recursively on initial order parameters to calculate the time evolution of the phase fields. The results are compared to the ones obtained from the conventional numerical solution in terms of the errors in order parameters and the system's free energy. The resulting order parameter error averaged over all points and all simulation cases is 0.005 and the relative error in the total free energy in all simulation boxes does not exceed 1%.
    Synthesized Speech Detection Using Convolutional Transformer-Based Spectrogram Analysis. (arXiv:2205.01800v1 [cs.SD])
    Synthesized speech is common today due to the prevalence of virtual assistants, easy-to-use tools for generating and modifying speech signals, and remote work practices. Synthesized speech can also be used for nefarious purposes, including creating a purported speech signal and attributing it to someone who did not speak the content of the signal. We need methods to detect if a speech signal is synthesized. In this paper, we analyze speech signals in the form of spectrograms with a Compact Convolutional Transformer (CCT) for synthesized speech detection. A CCT utilizes a convolutional layer that introduces inductive biases and shared weights into a network, allowing a transformer architecture to perform well with fewer data samples used for training. The CCT uses an attention mechanism to incorporate information from all parts of a signal under analysis. Trained on both genuine human voice signals and synthesized human voice signals, we demonstrate that our CCT approach successfully differentiates between genuine and synthesized speech signals.
    Uncertainty-Autoencoder-Based Privacy and Utility Preserving Data Type Conscious Transformation. (arXiv:2205.01950v1 [cs.LG])
    We propose an adversarial learning framework that deals with the privacy-utility tradeoff problem under two types of conditions: data-type ignorant, and data-type aware. Under data-type aware conditions, the privacy mechanism provides a one-hot encoding of categorical features, representing exactly one class, while under data-type ignorant conditions the categorical variables are represented by a collection of scores, one for each class. We use a neural network architecture consisting of a generator and a discriminator, where the generator consists of an encoder-decoder pair, and the discriminator consists of an adversary and a utility provider. Unlike previous research considering this kind of architecture, which leverages autoencoders (AEs) without introducing any randomness, or variational autoencoders (VAEs) based on learning latent representations which are then forced into a Gaussian assumption, our proposed technique introduces randomness and removes the Gaussian assumption restriction on the latent variables, only focusing on the end-to-end stochastic mapping of the input to privatized data. We test our framework on different datasets: MNIST, FashionMNIST, UCI Adult, and US Census Demographic Data, providing a wide range of possible private and utility attributes. We use multiple adversaries simultaneously to test our privacy mechanism -- some trained from the ground truth data and some trained from the perturbed data generated by our privacy mechanism. Through comparative analysis, our results demonstrate better privacy and utility guarantees than the existing works under similar, data-type ignorant conditions, even when the latter are considered under their original restrictive single-adversary model.
    Do More Negative Samples Necessarily Hurt in Contrastive Learning?. (arXiv:2205.01789v1 [cs.LG])
    Recent investigations in noise contrastive estimation suggest, both empirically as well as theoretically, that while having more "negative samples" in the contrastive loss improves downstream classification performance initially, beyond a threshold, it hurts downstream performance due to a "collision-coverage" trade-off. But is such a phenomenon inherent in contrastive learning? We show in a simple theoretical setting, where positive pairs are generated by sampling from the underlying latent class (introduced by Saunshi et al. (ICML 2019)), that the downstream performance of the representation optimizing the (population) contrastive loss in fact does not degrade with the number of negative samples. Along the way, we give a structural characterization of the optimal representation in our framework, for noise contrastive estimation. We also provide empirical support for our theoretical results on CIFAR-10 and CIFAR-100 datasets.
    Generalized Knowledge Distillation via Relationship Matching. (arXiv:2205.01915v1 [cs.CV])
    The knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks. Knowledge distillation extracts knowledge from the teacher and integrates it with the target model (a.k.a. the "student"), which expands the student's knowledge and improves its learning efficacy. Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space -- in this "Generalized Knowledge Distillation (GKD)", the classes of the teacher and the student may be the same, completely different, or partially overlapped. We claim that the comparison ability between instances acts as an essential factor threading knowledge across tasks, and propose the RElationship FacIlitated Local cLassifiEr Distillation (REFILLED) approach, which decouples the GKD flow of the embedding and the top-layer classifier. In particular, different from reconciling the instance-label confidence between models, REFILLED requires the teacher to reweight the hard tuples pushed forward by the student and then matches the similarity comparison levels between instances. An embedding-induced classifier based on the teacher model supervises the student's classification confidence and adaptively emphasizes the most related supervision from the teacher. REFILLED demonstrates strong discriminative ability when the classes of the teacher vary from the same to a fully non-overlapped set w.r.t. the student. It also achieves state-of-the-art performance on standard knowledge distillation, one-step incremental learning, and few-shot learning tasks.
    Zero-Episode Few-Shot Contrastive Predictive Coding: Solving intelligence tests without prior training. (arXiv:2205.01924v1 [cs.CV])
    Video prediction models often combine three components: an encoder from pixel space to a small latent space, a latent space prediction model, and a generative model back to pixel space. However, the large and unpredictable pixel space makes training such models difficult, requiring many training examples. We argue that finding a predictive latent variable and using it to evaluate the consistency of a future image enables data-efficient predictions because it precludes the necessity of a generative model training. To demonstrate it, we created sequence completion intelligence tests in which the task is to identify a predictably changing feature in a sequence of images and use this prediction to select the subsequent image. We show that a one-dimensional Markov Contrastive Predictive Coding (M-CPC_1D) model solves these tests efficiently, with only five examples. Finally, we demonstrate the usefulness of M-CPC_1D in solving two tasks without prior training: anomaly detection and stochastic movement video prediction.
    A Deep Learning-based Integrated Framework for Quality-aware Undersampled Cine Cardiac MRI Reconstruction and Analysis. (arXiv:2205.01673v1 [eess.IV])
    Cine cardiac magnetic resonance (CMR) imaging is considered the gold standard for cardiac function evaluation. However, cine CMR acquisition is inherently slow and in recent decades considerable effort has been put into accelerating scan times without compromising image quality or the accuracy of derived results. In this paper, we present a fully-automated, quality-controlled integrated framework for reconstruction, segmentation and downstream analysis of undersampled cine CMR data. The framework enables active acquisition of radial k-space data, in which acquisition can be stopped as soon as acquired data are sufficient to produce high quality reconstructions and segmentations. This results in reduced scan times and automated analysis, enabling robust and accurate estimation of functional biomarkers. To demonstrate the feasibility of the proposed approach, we perform realistic simulations of radial k-space acquisitions on a dataset of subjects from the UK Biobank and present results on in-vivo cine CMR k-space data collected from healthy subjects. The results demonstrate that our method can produce quality-controlled images in a mean scan time reduced from 12 to 4 seconds per slice, and that image quality is sufficient to allow clinically relevant parameters to be automatically estimated to within 5% mean absolute difference.
    CoCa: Contrastive Captioners are Image-Text Foundation Models. (arXiv:2205.01917v1 [cs.CV])
    Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.
    Meta-Cognition. An Inverse-Inverse Reinforcement Learning Approach for Cognitive Radars. (arXiv:2205.01794v1 [eess.SP])
    This paper considers meta-cognitive radars in an adversarial setting. A cognitive radar optimally adapts its waveform (response) in response to maneuvers (probes) of a possibly adversarial moving target. A meta-cognitive radar is aware of the adversarial nature of the target and seeks to mitigate the adversarial target. How should the meta-cognitive radar choose its responses to sufficiently confuse the adversary trying to estimate the radar's utility function? This paper abstracts the radar's meta-cognition problem in terms of the spectra (eigenvalues) of the state and observation noise covariance matrices, and embeds the algebraic Riccati equation into an economics-based utility maximization setup. This adversarial target is an inverse reinforcement learner. By observing a noisy sequence of radar's responses (waveforms), the adversarial target uses a statistical hypothesis test to detect if the radar is a utility maximizer. In turn, the meta-cognitive radar deliberately chooses sub-optimal responses that increasing its Type-I error probability of the adversary's detector. We call this counter-adversarial step taken by the meta-cognitive radar as inverse inverse reinforcement learning (I-IRL). We illustrate the meta-cognition results of this paper via simple numerical examples. Our approach for meta-cognition in this paper is based on revealed preference theory in micro-economics and inspired by results in differential privacy and adversarial obfuscation in machine learning.
    Second Order Path Variationals in Non-Stationary Online Learning. (arXiv:2205.01921v1 [cs.LG])
    We consider the problem of universal dynamic regret minimization under exp-concave and smooth losses. We show that appropriately designed Strongly Adaptive algorithms achieve a dynamic regret of $\tilde O(d^2 n^{1/5} C_n^{2/5} \vee d^2)$, where $n$ is the time horizon and $C_n$ a path variational based on second order differences of the comparator sequence. Such a path variational naturally encodes comparator sequences that are piecewise linear -- a powerful family that tracks a variety of non-stationarity patterns in practice (Kim et al, 2009). The aforementioned dynamic regret rate is shown to be optimal modulo dimension dependencies and poly-logarithmic factors of $n$. Our proof techniques rely on analysing the KKT conditions of the offline oracle and requires several non-trivial generalizations of the ideas in Baby and Wang, 2021, where the latter work only leads to a slower dynamic regret rate of $\tilde O(d^{2.5}n^{1/3}C_n^{2/3} \vee d^{2.5})$ for the current problem.
    A Comprehensive Survey and Taxonomy on Image Dehazing Based on Deep Learning. (arXiv:2106.03323v2 [cs.CV] UPDATED)
    With the development of convolutional neural networks, hundreds of deep learning based dehazing methods have been proposed. In this paper, we provide a comprehensive survey on supervised, semi-supervised, and unsupervised dehazing. We first discuss the physical model, datasets, network modules, loss functions, and evaluation metrics that are commonly used. Then, the main contributions of various dehazing algorithms are categorized and summarized. Further, quantitative and qualitative experiments of various baseline methods are carried out. Finally, the unsolved issues and challenges that can inspire the future research are pointed out. A collection of useful dehazing materials is available at https://github.com/Xiaofeng-life/AwesomeDehazing.
    Investigating the Impact of Multi-LiDAR Placement on Object Detection for Autonomous Driving. (arXiv:2105.00373v4 [cs.RO] UPDATED)
    The past few years have witnessed an increasing interest in improving the perception performance of LiDARs on autonomous vehicles. While most of the existing works focus on developing new deep learning algorithms or model architectures, we study the problem from the physical design perspective, i.e., how different placements of multiple LiDARs influence the learning-based perception. To this end, we introduce an easy-to-compute information-theoretic surrogate metric to quantitatively and fast evaluate LiDAR placement for 3D detection of different types of objects. We also present a new data collection, detection model training and evaluation framework in the realistic CARLA simulator to evaluate disparate multi-LiDAR configurations. Using several prevalent placements inspired by the designs of self-driving companies, we show the correlation between our surrogate metric and object detection performance of different representative algorithms on KITTI through extensive experiments, validating the effectiveness of our LiDAR placement evaluation approach. Our results show that sensor placement is non-negligible in 3D point cloud-based object detection, which will contribute up to 10% performance discrepancy in terms of average precision in challenging 3D object detection settings. We believe that this is one of the first studies to quantitatively investigate the influence of LiDAR placement on perception performance. The code is available on https://github.com/HanjiangHu/Multi-LiDAR-Placement-for-3D-Detection.
    AmbiPun: Generating Humorous Puns with Ambiguous Context. (arXiv:2205.01825v1 [cs.CL])
    In this paper, we propose a simple yet effective way to generate pun sentences that does not require any training on existing puns. Our approach is inspired by humor theories that ambiguity comes from the context rather than the pun word itself. Given a pair of definitions of a pun word, our model first produces a list of related concepts through a reverse dictionary. We then utilize one-shot GPT3 to generate context words and then generate puns incorporating context words from both concepts. Human evaluation shows that our method successfully generates pun 52\% of the time, outperforming well-crafted baselines and the state-of-the-art models by a large margin.
  • Open

    Learning the temporal evolution of multivariate densities via normalizing flows. (arXiv:2107.13735v2 [stat.ML] UPDATED)
    In this work, we propose a method to learn multivariate probability distributions using sample path data from stochastic differential equations. Specifically, we consider temporally evolving probability distributions (e.g., those produced by integrating local or nonlocal Fokker-Planck equations). We analyze this evolution through machine learning assisted construction of a time-dependent mapping that takes a reference distribution (say, a Gaussian) to each and every instance of our evolving distribution. If the reference distribution is the initial condition of a Fokker-Planck equation, what we learn is the time-T map of the corresponding solution. Specifically, the learned map is a multivariate normalizing flow that deforms the support of the reference density to the support of each and every density snapshot in time. We demonstrate that this approach can approximate probability density function evolutions in time from observed sampled data for systems driven by both Brownian and L\'evy noise. We present examples with two- and three-dimensional, uni- and multimodal distributions to validate the method.  ( 2 min )
    Two Stage Curvature Identification with Machine Learning: Causal Inference with Possibly Invalid Instrumental Variables. (arXiv:2203.12808v2 [stat.ME] UPDATED)
    Instrumental variables regression is a popular causal inference method for endogenous treatment. A significant concern in practical applications is the validity and strength of instrumental variables. This paper aims to perform causal inference when all instruments are possibly invalid. To do this, we propose a novel methodology called two stage curvature identification (TSCI) together with a generalized concept to measure the strengths of possibly invalid instruments: such invalid instruments can still be used for inference in our framework. We fit the treatment model with a general machine learning method and propose a novel bias correction method to remove the overfitting bias from machine learning methods. Among a collection of spaces of violation functions, we choose the best one by evaluating invalid instrumental variables' strength. We demonstrate our proposed TSCI methodology in a large-scale simulation study and revisit the important economics question on the effect of education on earnings.
    Second Order Path Variationals in Non-Stationary Online Learning. (arXiv:2205.01921v1 [cs.LG])
    We consider the problem of universal dynamic regret minimization under exp-concave and smooth losses. We show that appropriately designed Strongly Adaptive algorithms achieve a dynamic regret of $\tilde O(d^2 n^{1/5} C_n^{2/5} \vee d^2)$, where $n$ is the time horizon and $C_n$ a path variational based on second order differences of the comparator sequence. Such a path variational naturally encodes comparator sequences that are piecewise linear -- a powerful family that tracks a variety of non-stationarity patterns in practice (Kim et al, 2009). The aforementioned dynamic regret rate is shown to be optimal modulo dimension dependencies and poly-logarithmic factors of $n$. Our proof techniques rely on analysing the KKT conditions of the offline oracle and requires several non-trivial generalizations of the ideas in Baby and Wang, 2021, where the latter work only leads to a slower dynamic regret rate of $\tilde O(d^{2.5}n^{1/3}C_n^{2/3} \vee d^{2.5})$ for the current problem.
    Making SGD Parameter-Free. (arXiv:2205.02160v1 [math.OC])
    We develop an algorithm for parameter-free stochastic convex optimization (SCO) whose rate of convergence is only a double-logarithmic factor larger than the optimal rate for the corresponding known-parameter setting. In contrast, the best previously known rates for parameter-free SCO are based on online parameter-free regret bounds, which contain unavoidable excess logarithmic terms compared to their known-parameter counterparts. Our algorithm is conceptually simple, has high-probability guarantees, and is also partially adaptive to unknown gradient norms, smoothness, and strong convexity. At the heart of our results is a novel parameter-free certificate for SGD step size choice, and a time-uniform concentration result that assumes no a-priori bounds on SGD iterates.
    Multiple Testing and Variable Selection along the path of the Least Angle Regression. (arXiv:1906.12072v5 [math.ST] UPDATED)
    We investigate multiple testing and variable selection using the Least Angle Regression (LARS) algorithm in high dimensions under the assumption of Gaussian noise. LARS is known to produce a piecewise affine solution path with change points referred to as the knots of the LARS path. The key to our results is an expression in closed form of the exact joint law of a $K$-tuple of knots conditional on the variables selected by LARS, namely the so-called post-selection joint law of the LARS knots. Numerical experiments demonstrate the perfect fit of our findings. This paper makes three main contributions. First, we build testing procedures on variables entering the model along the LARS path in the general design case when the noise level can be unknown. These testing procedures are referred to as the Generalized $t$-Spacing tests (GtSt) and we prove that they have an exact non-asymptotic level (i.e., the Type I error is exactly controlled). This extends work of (Taylor et al., 2014) where the spacing test works for consecutive knots and known variance. Second, we introduce a new exact multiple false negatives test after model selection in the general design case when the noise level may be unknown. We prove that this testing procedure has exact non-asymptotic level for general design and unknown noise level. Third, we give an exact control of the false discovery rate under orthogonal design assumption. Monte Carlo simulations and a real data experiment are provided to illustrate our results in this case. Of independent interest, we introduce an equivalent formulation of the LARS algorithm based on a recursive function.
    VICE: Variational Inference for Concept Embeddings. (arXiv:2205.00756v3 [cs.LG] UPDATED)
    In this paper, we introduce Variational Inference for Concept Embeddings (VICE), an approximate Bayesian method for learning object concept embeddings from human behavior in an odd-one-out triplet task. We use variational inference to obtain a sparse, non-negative solution with uncertainty estimates about each embedding value. We exploit these estimates to automatically select the dimensions that explain the data while yielding reproducible embeddings. We introduce a PAC learning bound for VICE that can be used to estimate generalization performance or determine a sufficient sample size for different experimental designs. VICE rivals or outperforms its predecessor, SPoSE, at predicting human behavior in a triplet task. VICE object representations are substantially more reproducible and consistent across different random initializations.
    Depth Uncertainty Networks for Active Learning. (arXiv:2112.06796v2 [cs.LG] UPDATED)
    In active learning, the size and complexity of the training dataset changes over time. Simple models that are well specified by the amount of data available at the start of active learning might suffer from bias as more points are actively sampled. Flexible models that might be well suited to the full dataset can suffer from overfitting towards the start of active learning. We tackle this problem using Depth Uncertainty Networks (DUNs), a BNN variant in which the depth of the network, and thus its complexity, is inferred. We find that DUNs outperform other BNN variants on several active learning tasks. Importantly, we show that on the tasks in which DUNs perform best they present notably less overfitting than baselines.
    Negative Sampling in Variational Autoencoders. (arXiv:1910.02760v3 [cs.LG] UPDATED)
    Modern deep artificial neural networks have achieved great success in the domain of computer vision and beyond. However, their application to many real-world tasks is undermined by certain limitations, such as overconfident uncertainty estimates on out-of-distribution data or performance deterioration under data distribution shifts. Several types of deep learning models used for density estimation through probabilistic generative modeling have been shown to fail to detect out-of-distribution samples by assigning higher likelihoods to anomalous data. We investigate this failure mode in Variational Autoencoder models, which are also prone to this, and improve upon the out-of-distribution generalization performance of the model by employing an alternative training scheme utilizing negative samples. We present a fully unsupervised version: when the model is trained in an adversarial manner, the generator's own outputs can be used as negative samples. We demonstrate empirically the effectiveness of the approach in reducing the overconfident likelihood estimates of out-of-distribution inputs on image data.
    Policy Optimization Using Semi-parametric Models for Dynamic Pricing. (arXiv:2109.06368v2 [cs.LG] UPDATED)
    In this paper, we study the contextual dynamic pricing problem where the market value of a product is linear in its observed features plus some market noise. Products are sold one at a time, and only a binary response indicating success or failure of a sale is observed. Our model setting is similar to Javanmard and Nazerzadeh [2019] except that we expand the demand curve to a semiparametric model and need to learn dynamically both parametric and nonparametric components. We propose a dynamic statistical learning and decision-making policy that combines semiparametric estimation from a generalized linear model with an unknown link and online decision-making to minimize regret (maximize revenue). Under mild conditions, we show that for a market noise c.d.f. $F(\cdot)$ with $m$-th order derivative ($m\geq 2$), our policy achieves a regret upper bound of $\tilde{O}_{d}(T^{\frac{2m+1}{4m-1}})$, where $T$ is time horizon and $\tilde{O}_{d}$ is the order that hides logarithmic terms and the dimensionality of feature $d$. The upper bound is further reduced to $\tilde{O}_{d}(\sqrt{T})$ if $F$ is super smooth whose Fourier transform decays exponentially. In terms of dependence on the horizon $T$, these upper bounds are close to $\Omega(\sqrt{T})$, the lower bound where $F$ belongs to a parametric class. We further generalize these results to the case with dynamically dependent product features under the strong mixing condition.
    Recurrent Flow Networks: A Recurrent Latent Variable Model for Density Modelling of Urban Mobility. (arXiv:2006.05256v2 [stat.ML] UPDATED)
    Mobility-on-demand (MoD) systems represent a rapidly developing mode of transportation wherein travel requests are dynamically handled by a coordinated fleet of vehicles. Crucially, the efficiency of an MoD system highly depends on how well supply and demand distributions are aligned in spatio-temporal space (i.e., to satisfy user demand, cars have to be available in the correct place and at the desired time). To do so, we argue that predictive models should aim to explicitly disentangle between temporal} and spatial variability in the evolution of urban mobility demand. However, current approaches typically ignore this distinction by either treating both sources of variability jointly, or completely ignoring their presence in the first place. In this paper, we propose recurrent flow networks (RFN), where we explore the inclusion of (i) latent random variables in the hidden state of recurrent neural networks to model temporal variability, and (ii) normalizing flows to model the spatial distribution of mobility demand. We demonstrate how predictive models explicitly disentangling between spatial and temporal variability exhibit several desirable properties, and empirically show how this enables the generation of distributions matching potentially complex urban topologies.
    Minimax Estimation of Partially-Observed Vector AutoRegressions. (arXiv:2106.09327v2 [eess.SP] UPDATED)
    High-dimensional time series are a core ingredient of the statistical modeling toolkit, for which numerous estimation methods are known. But when observations are scarce or corrupted, the learning task becomes much harder. The question is: how much harder? In this paper, we study the properties of a partially-observed Vector AutoRegressive process, which is a state-space model endowed with a stochastic observation mechanism. Our goal is to estimate its sparse transition matrix, but we only have access to a small and noisy subsample of the state components. Interestingly, the sampling process itself is random and can exhibit temporal correlations, a feature shared by many realistic data acquisition scenarios. We start by describing an estimator based on the Yule-Walker equation and the Dantzig selector, and we give an upper bound on its non-asymptotic error. Then, we provide a matching minimax lower bound, thus proving near-optimality of our estimator. The convergence rate we obtain sheds light on the role of several key parameters such as the sampling ratio, the amount of noise and the number of non-zero coefficients in the transition matrix. These theoretical findings are commented and illustrated by numerical experiments on simulated data.
    Generalized Multi-Output Gaussian Process Censored Regression. (arXiv:2009.04822v2 [stat.ML] UPDATED)
    When modelling censored observations, a typical approach in current regression methods is to use a censored-Gaussian (i.e. Tobit) model to describe the conditional output distribution. In this paper, as in the case of missing data, we argue that exploiting correlations between multiple outputs can enable models to better address the bias introduced by censored data. To do so, we introduce a heteroscedastic multi-output Gaussian process model which combines the non-parametric flexibility of GPs with the ability to leverage information from correlated outputs under input-dependent noise conditions. To address the resulting inference intractability, we further devise a variational bound to the marginal log-likelihood suitable for stochastic optimization. We empirically evaluate our model against other generative models for censored data on both synthetic and real world tasks and further show how it can be generalized to deal with arbitrary likelihood functions. Results show how the added flexibility allows our model to better estimate the underlying non-censored (i.e. true) process under potentially complex censoring dynamics.
    Better Parameter-free Stochastic Optimization with ODE Updates for Coin-Betting. (arXiv:2006.07507v3 [cs.LG] UPDATED)
    Parameter-free stochastic gradient descent (PFSGD) algorithms do not require setting learning rates while achieving optimal theoretical performance. In practical applications, however, there remains an empirical gap between tuned stochastic gradient descent (SGD) and PFSGD. In this paper, we close the empirical gap with a new parameter-free algorithm based on continuous-time Coin-Betting on truncated models. The new update is derived through the solution of an Ordinary Differential Equation (ODE) and solved in a closed form. We show empirically that this new parameter-free algorithm outperforms algorithms with the "best default" learning rates and almost matches the performance of finely tuned baselines without anything to tune.
    Saving Stochastic Bandits from Poisoning Attacks via Limited Data Verification. (arXiv:2102.07711v2 [cs.LG] UPDATED)
    We study bandit algorithms under data poisoning attacks in a bounded reward setting. We consider a strong attacker model in which the attacker can observe both the selected actions and their corresponding rewards and can contaminate the rewards with additive noise. We show that any bandit algorithm with regret $O(\log T)$ can be forced to suffer a regret $\Omega(T)$ with an expected amount of contamination $O(\log T)$. This amount of contamination is also necessary, as we prove that there exists an $O(\log T)$ regret bandit algorithm, specifically the classical UCB, that requires $\Omega(\log T)$ amount of contamination to suffer regret $\Omega(T)$. To combat such attacks, our second main contribution is to propose verification based mechanisms, which use limited verification to access a limited number of uncontaminated rewards. In particular, for the case of unlimited verifications, we show that with $O(\log T)$ expected number of verifications, a simple modified version of the ETC type bandit algorithm can restore the order optimal $O(\log T)$ regret irrespective of the amount of contamination used by the attacker. We also provide a UCB-like verification scheme, called Secure-UCB, that also enjoys full recovery from any attacks, also with $O(\log T)$ expected number of verifications. To derive a matching lower bound on the number of verifications, we prove that for any order-optimal bandit algorithm, this number of verifications $\Omega(\log T)$ is necessary to recover the order-optimal regret. On the other hand, when the number of verifications is bounded above by a budget $B$, we propose a novel algorithm, Secure-BARBAR, which provably achieves $O(\min\{C,T/\sqrt{B} \})$ regret with high probability against weak attackers where $C$ is the total amount of contamination by the attacker, which breaks the known $\Omega(C)$ lower bound of the non-verified setting if $C$ is large.
    A Manifold Two-Sample Test Study: Integral Probability Metric with Neural Networks. (arXiv:2205.02043v1 [stat.ML])
    Two-sample tests are important areas aiming to determine whether two collections of observations follow the same distribution or not. We propose two-sample tests based on integral probability metric (IPM) for high-dimensional samples supported on a low-dimensional manifold. We characterize the properties of proposed tests with respect to the number of samples $n$ and the structure of the manifold with intrinsic dimension $d$. When an atlas is given, we propose two-step test to identify the difference between general distributions, which achieves the type-II risk in the order of $n^{-1/\max\{d,2\}}$. When an atlas is not given, we propose H\"older IPM test that applies for data distributions with $(s,\beta)$-H\"older densities, which achieves the type-II risk in the order of $n^{-(s+\beta)/d}$. To mitigate the heavy computation burden of evaluating the H\"older IPM, we approximate the H\"older function class using neural networks. Based on the approximation theory of neural networks, we show that the neural network IPM test has the type-II risk in the order of $n^{-(s+\beta)/d}$, which is in the same order of the type-II risk as the H\"older IPM test. Our proposed tests are adaptive to low-dimensional geometric structure because their performance crucially depends on the intrinsic dimension instead of the data dimension.
    Estimation of Standard Auction Models. (arXiv:2205.02060v1 [cs.GT])
    We provide efficient estimation methods for first- and second-price auctions under independent (asymmetric) private values and partial observability. Given a finite set of observations, each comprising the identity of the winner and the price they paid in a sequence of identical auctions, we provide algorithms for non-parametrically estimating the bid distribution of each bidder, as well as their value distributions under equilibrium assumptions. We provide finite-sample estimation bounds which are uniform in that their error rates do not depend on the bid/value distributions being estimated. Our estimation guarantees advance a body of work in Econometrics wherein only identification results have been obtained, unless the setting is symmetric, parametric, or all bids are observable. Our guarantees also provide computationally and statistically effective alternatives to classical techniques from reliability theory. Finally, our results are immediately applicable to Dutch and English auctions.
    The Grammar of Interactive Explanatory Model Analysis. (arXiv:2005.00497v4 [cs.LG] UPDATED)
    The growing need for in-depth analysis of predictive models leads to a series of new methods for explaining their local and global properties. Which of these methods is the best? It turns out that this is an ill-posed question. One cannot sufficiently explain a black-box machine learning model using a single method that gives only one perspective. Isolated explanations are prone to misunderstanding, leading to wrong or simplistic reasoning. This problem is known as the Rashomon effect and refers to diverse, even contradictory, interpretations of the same phenomenon. Surprisingly, most methods developed for explainable and responsible machine learning focus on a single-aspect of the model behavior. In contrast, we showcase the problem of explainability as an interactive and sequential analysis of a model. This paper proposes how different Explanatory Model Analysis (EMA) methods complement each other and discusses why it is essential to juxtapose them. The introduced process of Interactive EMA (IEMA) derives from the algorithmic side of explainable machine learning and aims to embrace ideas developed in cognitive sciences. We formalize the grammar of IEMA to describe potential human-model dialogues. It is implemented in a widely used human-centered open-source software framework that adopts interactivity, customizability and automation as its main traits. We conduct a user study to evaluate the usefulness of IEMA, which indicates that an interactive sequential analysis of a model increases the performance and confidence of human decision making.
    Accelerating Inhibitor Discovery for Multiple SARS-CoV-2 Targets with a Single, Sequence-Guided Deep Generative Framework. (arXiv:2204.09042v2 [q-bio.QM] UPDATED)
    The COVID-19 pandemic has highlighted the urgency for developing more efficient molecular discovery pathways. As exhaustive exploration of the vast chemical space is infeasible, discovering novel inhibitor molecules for emerging drug-target proteins is challenging, particularly for targets with unknown structure or ligands. We demonstrate the broad utility of a single deep generative framework toward discovering novel drug-like inhibitor molecules against two distinct SARS-CoV-2 targets -- the main protease (Mpro) and the receptor binding domain (RBD) of the spike protein. To perform target-aware design, the framework employs a target sequence-conditioned sampling of novel molecules from a generative model. Micromolar-level in vitro inhibition was observed for two candidates (out of four synthesized) for each target. The most potent spike RBD inhibitor also emerged as a rare non-covalent antiviral with broad-spectrum activity against several SARS-CoV-2 variants in live virus neutralization assays. These results show that a broadly deployable machine intelligence framework can accelerate hit discovery across different emerging drug-targets.
    Machine Learning based Framework for Robust Price-Sensitivity Estimation with Application to Airline Pricing. (arXiv:2205.01875v1 [stat.ML])
    We consider the problem of dynamic pricing of a product in the presence of feature-dependent price sensitivity. Based on the Poisson semi-parametric approach, we construct a flexible yet interpretable demand model where the price related part is parametric while the remaining (nuisance) part of the model is non-parametric and can be modeled via sophisticated ML techniques. The estimation of price-sensitivity parameters of this model via direct one-stage regression techniques may lead to biased estimates. We propose a two-stage estimation methodology which makes the estimation of the price-sensitivity parameters robust to biases in the nuisance parameters of the model. In the first-stage we construct the estimators of observed purchases and price given the feature vector using sophisticated ML estimators like deep neural networks. Utilizing the estimators from the first-stage, in the second-stage we leverage a Bayesian dynamic generalized linear model to estimate the price-sensitivity parameters. We test the performance of the proposed estimation schemes on simulated and real sales transaction data from Airline industry. Our numerical studies demonstrate that the two-stage approach provides more accurate estimates of price-sensitivity parameters as compared to direct one-stage approach.
    Local versions of sum-of-norms clustering. (arXiv:2109.09589v2 [cs.LG] UPDATED)
    Sum-of-norms clustering is a convex optimization problem whose solution can be used for the clustering of multivariate data. We propose and study a localized version of this method, and show in particular that it can separate arbitrarily close balls in the stochastic ball model. More precisely, we prove a quantitative bound on the error incurred in the clustering of disjoint connected sets. Our bound is expressed in terms of the number of datapoints and the localization length of the functional.
    Nonstationary Bandit Learning via Predictive Sampling. (arXiv:2205.01970v1 [cs.LG])
    We propose predictive sampling as an approach to selecting actions that balance between exploration and exploitation in nonstationary bandit environments. When specialized to stationary environments, predictive sampling is equivalent to Thompson sampling. However, predictive sampling is effective across a range of nonstationary environments in which Thompson sampling suffers. We establish a general information-theoretic bound on the Bayesian regret of predictive sampling. We then specialize this bound to study a modulated Bernoulli bandit environment. Our analysis highlights a key advantage of predictive sampling over Thompson sampling: predictive sampling deprioritizes investments in exploration where acquired information will quickly become less relevant.
    Sparse Representations of Positive Functions via First and Second-Order Pseudo-Mirror Descent. (arXiv:2011.07142v4 [stat.ML] UPDATED)
    We consider expected risk minimization problems when the range of the estimator is required to be nonnegative, motivated by the settings of maximum likelihood estimation (MLE) and trajectory optimization. To facilitate nonlinear interpolation, we hypothesize that the search space is a Reproducing Kernel Hilbert Space (RKHS). We develop first and second-order variants of stochastic mirror descent employing (i) \emph{pseudo-gradients} and (ii) complexity-reducing projections. Compressive projection in the first-order scheme is executed via kernel orthogonal matching pursuit (KOMP), which overcomes the fact that the vanilla RKHS parameterization grows unbounded with the iteration index in the stochastic setting. Moreover, pseudo-gradients are needed when gradient estimates for cost are only computable up to some numerical error, which arise in, e.g., integral approximations. Under constant step-size and compression budget, we establish tradeoffs between the radius of convergence of the expected sub-optimality and the projection budget parameter, as well as non-asymptotic bounds on the model complexity. To refine the solution's precision, we develop a second-order extension which employs recursively averaged pseudo-gradient outer-products to approximate the Hessian inverse, whose convergence in mean is established under an additional eigenvalue decay condition on the Hessian of the optimal RKHS element, which is unique to this work. Experiments demonstrate favorable performance on inhomogeneous Poisson Process intensity estimation in practice.
    B\'ezier Curve Gaussian Processes. (arXiv:2205.01754v1 [stat.ML])
    Probabilistic models for sequential data are the basis for a variety of applications concerned with processing timely ordered information. The predominant approach in this domain is given by neural networks, which incorporate either stochastic units or components. This paper proposes a new probabilistic sequence model building on probabilistic B\'ezier curves. Using Gaussian distributed control points, these parametric curves pose a special case for Gaussian processes (GP). Combined with a Mixture Density network, Bayesian conditional inference can be performed without the need for mean field variational approximation or Monte Carlo simulation, which is a requirement of common approaches. For assessing this hybrid model's viability, it is applied to an exemplary sequence prediction task. In this case the model is used for pedestrian trajectory prediction, where a generated prediction also serves as a GP prior. Following this, the initial prediction can be refined using the GP framework by calculating different posterior distributions, in order to adapt more towards a given observed trajectory segment.  ( 2 min )
    Do More Negative Samples Necessarily Hurt in Contrastive Learning?. (arXiv:2205.01789v1 [cs.LG])
    Recent investigations in noise contrastive estimation suggest, both empirically as well as theoretically, that while having more "negative samples" in the contrastive loss improves downstream classification performance initially, beyond a threshold, it hurts downstream performance due to a "collision-coverage" trade-off. But is such a phenomenon inherent in contrastive learning? We show in a simple theoretical setting, where positive pairs are generated by sampling from the underlying latent class (introduced by Saunshi et al. (ICML 2019)), that the downstream performance of the representation optimizing the (population) contrastive loss in fact does not degrade with the number of negative samples. Along the way, we give a structural characterization of the optimal representation in our framework, for noise contrastive estimation. We also provide empirical support for our theoretical results on CIFAR-10 and CIFAR-100 datasets.  ( 2 min )

  • Open

    [D] What do you think about PolyLoss?
    Paper - PolyLoss: A Polynomial Expansion Perspective Of Classification Loss Functions One reviewer believes that "this paper makes some interesting and thorough findings by approximating cross entropy loss using Taylor expansion." Another reviewer mentioned "In my comparisons it performed worse than LabelSmoothingCrossEntropy." Link: https://doublind.com/paper/2204.12511-PolyLoss:-A-Polynomial-Expansion-Perspective-of-Classification-Loss-Functions Have you read this paper? What do you think? ​ https://preview.redd.it/i43pynaxijx81.png?width=1559&format=png&auto=webp&s=640396a8fc4c4691730e52f8a1a92a9c1e57a3c5 submitted by /u/DouBlindDotCOM [link] [comments]  ( 1 min )
    [R] crop video when a character appears.
    I don't know if this has a name I would like to know if there is an ai to cut out a character that appears in a video at a time. that is, to have an input video where many characters appear and an output video with all the frames where the character that interests me appears. submitted by /u/macob12432 [link] [comments]  ( 1 min )
    [D] Proving Convergence of hybridlike training of DNNs
    Hi all! I am working on a submission for a specialized prediction task where the inputs go from space A->B->C. I have model F which predicts from A->B and model G which predicts from B->C. Now, I have trained these models in 3 different ways: Train F first and then G. The G model has a minima, the F model has a minima. Train F and G jointly. The combined model has a global minima Train F, freeze G for first half of epochs and then train the complete model for the rest of the epochs. For approach 3, I am struggling to identify what would be the minima of the model? In other words, is it possible to prove that approach 3 converges? PS: I have GT labels for B and C space. Loss is calculated as a*CE(B)+b*CE(C). submitted by /u/dwight_funke [link] [comments]  ( 1 min )
    [R] Question about multi-hop question answering using knowledge graphs
    Is anyone familiar with any papers / datasets that are doing more than 3 multi-hops? Seems like every paper / dataset I look at is either 1,2, or 3 hops at best. I'm particularly interested in 10+ hops submitted by /u/DaBobcat [link] [comments]
    Fast API vs Flask? "[Discussion]"
    Hey all I host a postcast and recently interviewed Sebastián Ramirez the creator of Fast API. Aside from the cool convo, I have been noticing lots of trends about Fast API potentially replacing flask. I also saw lots of Fast API love in this thread in the MLOps Community where I asked about which one people generally use these days. I'm interested in getting more data points and kicking off a discussion to hear how others look at this one? Is Flask still your go to? do you use both? which one are you opinionated about and why? submitted by /u/dpbrinkm [link] [comments]  ( 1 min )
    [D] : HELP Finding a Book - A book written for Google Engineers about foundational Math to support ML
    I am searching for a book that I was written to help engineers and new hire get up to speed with foundational mathematics in support of ML research at Google. It was not so much focused on ML, but more so a handbook on pure mathematics, such as proofs (at least thats what I got from the first chapter and what I remember from my waning memory). It was written in a very intuitive way and I can't for the life of myself find it. ​ Does this ring a bell for anyone? submitted by /u/nutin2chere [link] [comments]  ( 1 min )
    [N] New Google Deepmind Flamingo Tackles Multiple Tasks With Single Visual Language Model
    Deepmind recently introduced Flamingo, a single visual language model (VLM) that sets a new state of the art in few-shot learning on a wide range of open-ended multimodal tasks. This means Flamingo can tackle a number of difficult problems with just a handful of task-specific examples (in a “few shots”), without any additional training required. Flamingo’s simple interface makes this possible, taking as input a prompt consisting of interleaved images, videos, and text and then output associated language. Similar to the behaviour of large language models (LLMs), which can address a language task by processing examples of the task in their text prompt, Flamingo’s visual and text interface can steer the model towards solving a multimodal task. Given a few example pairs of visual inputs and expected text responses composed in Flamingo’s prompt, the model can be asked a question with a new image or video, and then generate an answer. Full website and whitepaper here Video here submitted by /u/SlightSituation [link] [comments]  ( 1 min )
    [P] Making Clubhouse's hallway more relevant with machine learning
    Hi there! Clubhouse shared a case-study on how it ranked recommendations for real-time, short-lived rooms in its Hallway with GBDTs and fast features. Thought you would find it interesting! Post here: https://blog.clubhouse.com/making-the-hallway-more-relevant-with-machine-learning/ Live Q&A: https://www.clubhouse.com/event/MOp41drX?utm_medium=ch_event&utm_campaign=sI95qy9i-EC5I3MvlueR7g-176282 submitted by /u/speedbreeze [link] [comments]  ( 1 min )
    [D] Accuracy Convergent Field Predictors
    Hello, Here is a paper on supervised machine learning topics: https://www.researchgate.net/publication/360264212_Accuracy_Convergent_Field_Predictors It discusses accuracy convergence for predictive algorithms. Techniques are presented for making algorithms achieve accuracy convergence. It also provides an analogy with the sampling of fields in physics. I am interested in your opinions. Thanks. submitted by /u/crispub [link] [comments]
    Why TensorFlow's API is more complete than PyTorch for everyday deep learning [D]
    What are some of the gaps between PyTorch (torch) in comparison to TensorFlow/ Keras (TF) usability? Sure, each one of these challenges is manually overcomeable for an advanced user, but together they add up to a more tedious experience. ​ Disclaimer; Please take this with a grain of salt and feel free to jump in with corrections. I am primarily a tf user that has only dabbled with torch. I welcome your help in drawing a more accurate comparison. I'm the creator of github.com/aiqc/aiqc, which abstracts/supports both tf and torch because they have the same components. --- 1. No NumPy input/ output for models. TF: allows you to feed your model an ndarray as inputs directly and return ndarray output. This is great because sklearn preprocessing and metrics use ndarrays. It makes for…  ( 6 min )
    [R] Using few-shot learning language models as weak supervision
    Large language models embed a lot of useful knowledge in their pre-trained weights, but they are typically insufficient solutions on their own, either due to knowledge gaps or inability to transfer what they know. But there’s another way. → https://snorkel.ai/few-shot-learning-large-language-models/ At Snorkel, we have seen practitioners get more utility from language models by applying these zero-shot or few-shot learners as labeling functions in a weak supervision framework. This tends to outperform both a language model and non-language-model LFs used in isolation. submitted by /u/robiriondo [link] [comments]  ( 1 min )
    [D] How to detect Covariate shift of NLP models?
    I have an NLP model, for example, Sentiment Analysis. This model serves in production. I want to detect Data Drift, and specifically Covariate Shift for this model. I saw that Cosine Similarity may solve this issue, but I'm concerned about: The ability to calculate it - Cosine similarity can be calculated for vectors that are in the same vector space. Are all of the embeddings that a model produces live in the same space? The time complexity - If I have 1M training data points and 1M prediction data points and I want to infer the average Cosine difference, I'll have to find the cosine difference of every prediction compared to every training data point. I can sample, but what is the sampling algorithm? I'd love to hear your opinion about: Different solutions to calculate Covariate shift for NLP models My concerns about the cosine similarity submitted by /u/igaloly [link] [comments]  ( 1 min )
    Architecture suggestion for multi-class classification task on images with captions [D]
    I am currently starting on a multi-class classification task on images with a small description of the images. We are given training data on images, their captions, and labels and we need to construct an architecture to classify labels of a new image with captions. I have done some research online and was suggested that the best architecture for multi-class classification task for images would be a combination of CNN and RNN but I couldn't find anywhere on how to utilize the caption with the images together for training. Any advice on where I should start? submitted by /u/anzhuoxianshen [link] [comments]  ( 1 min )
    [R][P][D] Available approaches to identify specific sentences in a text
    Hey guys, I am working on an NLP pipeline research project and want to reconsider one of its tasks. Currently, a heuristic is utilized to extract the target sentence that is most similar (centroid) to the overall text (the set of sentences which contain the target sentence). The target sentence would be the root of a graph I want to construct from the text. I have training data available with annotated sentences. Is there any recent approach I could implement to solve this problem? I thought about reformulating it as a question-answering approach using a Transformer, but I would like retrieve a more or less identical version of the target sentence. Looking forward to your answers! submitted by /u/wastingmytime69 [link] [comments]  ( 1 min )
    [P] Yaetos-A framework to simplify the creation of data pipelines
    Hi everyone, I would like to share a blog post explaining Yaetos, an open source data framework I created some time ago and used in previous companies : https://medium.com/@arthurprevot/yaetos-data-framework-description-ddc71caf6ce . It is meant for data scientists, engineers and analysts to create and schedule data pipelines and ML models in the AWS cloud. Any feedback on the tool or the article is welcome ! Thanks ! submitted by /u/arthurprevot [link] [comments]  ( 1 min )
    [P] Anomaly detection with similarity learning approach.
    Hi everyone! Anomaly detection is one of the exciting problems where metric learning can demonstrate an advantage over classical approaches. This case study illustrates how to do this with a practical example of quality control for coffee beans. How to train a detector of spoiled coffee beans with just a couple hundred labeled examples. https://qdrant.tech/articles/detecting-coffee-anomalies/ submitted by /u/devzaya [link] [comments]  ( 1 min )
    [N] D4 Data presents Podcast #15 "Federated Learning with Flower"
    Flower becomes international The traction of federated learning is increasing as well as for our open-source federated learning framework Flower (https://flower.dev/). In federated learning, we do not collect data to train AI models but we train AI models in data silos, only collect the AI models and aggregate them to create a global AI model. The global AI model has the knowledge of all data silos but has never seen their data. Therefore, federated learning connects data silos in a privacy-preserving manner. Many people understand already this functionality but some questions are still not answered such as: What is the difference between edge computing and federated learning? What are the use cases of federated learning? Can federated learning reduce the carbon footprint? If you want to know the answers then check out this podcast that was recorded by D4 Data Podcast. In addition, the history of federated learning and the differences between centralized learning and federated learning is presented so that also newbies to federated learning can easily understand the technology. https://www.youtube.com/watch?v=EFupbmLfkwQ submitted by /u/burnai [link] [comments]  ( 1 min )
    [D] [P] Trying to guess internal rules of an insurance company scoring mechanism
    Hi everyone! I'm relatively new to ML so maybe I'm inventing a bicycle here, but can you please hear me out and maybe give some advice. Let's say I'm an insurance broker providing data of my clients to insurance company to get a quote. Insurance company can either give me one or refuse to quote. The characteristics of the clients are encoded in a number of numerical values: age, car horse powers, coefficient depending on previous loss record, coefficient depending of territory and so on. So all numerical values. The decision of insurance company is based on some internal rules it has. For example: we don't insure drivers with loss record coefficient bigger than N or some other rules. Unfortunately company doesn't provide me with these rules. So I'd like to guess them to understand my target audience better and focus my marketing efforts only on those potential insureds that will for sure be provided with quote by insurance company. To achieve this I'm planning to do the following: build a model that will predict outcome of addressing the insurance company (1 - they agree to quote, 0 - they refuse) based on historic data of quotes and refusals I have on file. Then I will take an "average successful" quote and will start to change parameters of it one by one to see when my model will return 0 (insurance company refused to quote). By doing so I will try to guess boundaries of the coefficients in my data - meaning internal rules of insurance company. What do you think of this? How viable is this approach? submitted by /u/alex-and-r [link] [comments]  ( 2 min )
    [N] Jupyter Notebook Competition - Data science
    Hi everyone! I'd like to share a news about a competition you might be interested in. It's for those interested and passionate about #coding and #datascience / #bigdata. It's the Jupyter Notebook Competition that will run till 31 July 2022. Participants can choose between 4 tracks and develop/improve a notebook. You get to showcase your skills, uncover new insights on Copernicus data usage, and potentially win a cash prize from a total pool of €5,000! For more info: https://notebook.wekeo.eu/ The competition was funded by the #Copernicus programme and developed as a joint project with EUMETSAT, European Centre for Medium-Range Weather Forecasts, Mercator Ocean International and the European Environment Agency. submitted by /u/NbCompetition_WEkEO [link] [comments]  ( 1 min )
    [D]: What does Replika exactly consist of? (besides GPT-neo)
    I just wondered, besides the visual stuff, how does this chatbot work. Theres unmaintained libraries like cakechat on replika's github, but if someone could break down replika to its core components that would be fine. I imagine something like this: for loop to accept users input something to transform the input into something else better used for GPT-neo (what?) gpt-neo generator something to transform the output into something suitable for the user (what?) how do those 2 extra layers most likely work, any ideas to get deeper into it? submitted by /u/GerritTheBerrit [link] [comments]  ( 1 min )
    [P] Kaggle competition : Be honest, What do you think of this image? (Unsplash)
    Hello Reddit, I am writing this post to make some personal promotion of a Kaggle competition that I built to test the DVC framework. The goal of the competition is to predict the interest (combination of clicks,views and age) that an image can bring. Competition : https://www.kaggle.com/competitions/be-honest-what-do-you-think-of-this-image Have fun (Any feeback is welscome) submitted by /u/jeanmidev [link] [comments]  ( 1 min )
  • Open

    GraphWorld: Advances in Graph Benchmarking
    John Palowitch and Anton Tsitsulin, Research Scientists, Google Research, Graph Mining team Graphs are very common representations of natural systems that have connected relational components, such as social networks, traffic infrastructure, molecules, and the internet. Graph neural networks (GNNs) are powerful machine learning (ML) models for graphs that leverage their inherent connections to incorporate context into predictions about items within the graph or the graph as a whole. GNNs have been effectively used to discover new drugs, help mathematicians prove theorems, detect misinformation, and improve the accuracy of arrival time predictions in Google Maps. A surge of interest in GNNs during the last decade has produced thousands of GNN variants, with hundreds introduced each year. …  ( 8 min )
  • Open

    Wow ! So that's how a Data Science field is carried!!
    submitted by /u/networkninja10 [link] [comments]
    Another Firing Among Google’s A.I. Brain Trust, and More Discord
    submitted by /u/BB4evaTB12 [link] [comments]
    AI Dream 42 - Dramatic End of the World ByeBye City
    submitted by /u/LordPewPew777 [link] [comments]
    AI News | Breakthrough AI Robot Arm Pick And Place System | New Google Deepmind Flamingo Visual Language Model
    submitted by /u/getrich_or_diemining [link] [comments]
    Ripples in Time (made with starryai)
    submitted by /u/Losthel [link] [comments]
    DALL-E 2 is amazing, but what's even cooler is how it actually *understands* text and produces images. (Article version linked in description)
    submitted by /u/OnlyProggingForFun [link] [comments]
    Generate images from text with Latent Diffusion LAION-400M
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 3 min )
    Deepmind Introduces Flamingo: An Open-Ended Single Visual Language Model (VLM) For Multimodal Machine Learning Research
    Intelligence measures how quickly a person can adjust to a new situation using only a few simple instructions. Despite the contrasts between the two, children may recognize real animals in the zoo after seeing a few photographs of the animals in a book. On the other hand, Typical visual models do not yet reflect this level of human intellect. They need to be trained on tens of thousands of examples that have been explicitly annotated for that task. If the goal is to count and identify animals in an image, such as “three zebras,” thousands of photographs must be collected and each image annotated with their numbers and species. The requirement to train a new model each time it is confronted with a new job is the most predominant drawback, making the process inefficient, costly, and resource…  ( 1 min )
    Better text search
    Can somebody recommend a better way for me to search through a document of 50,000 document descriptions for things that may not exist in the description but contextually is. Example: Searching for "fintech" returns a company description that deals with finances but doesn't necessarily have the word "fintech" in it submitted by /u/CanadianBacon2021 [link] [comments]  ( 1 min )
    Has anyone heard of or participated in the Persolv AI boot camp?
    It offers some really cool opportunities but it’s extremely expensive. Was wondering if anyone has had any experience with it that they would be willing to share. Seems really cool, but I don’t want to get scammed out of thousands. submitted by /u/eatingbaboons [link] [comments]  ( 1 min )
  • Open

    Deploy and manage machine learning pipelines with Terraform using Amazon SageMaker
    AWS customers are relying on Infrastructure as Code (IaC) to design, develop, and manage their cloud infrastructure. IaC ensures that customer infrastructure and services are consistent, scalable, and reproducible, while being able to follow best practices in the area of development operations (DevOps). One possible approach to manage AWS infrastructure and services with IaC is […]  ( 6 min )
  • Open

    Computer Vision development services: a brief introduction
    Talking about the main Computer Vision development services and identifying key activities, goals, and outcomes to simplify…  ( 5 min )
  • Open

    Setting AIs on SIGGRAPH: Top Academic Researchers Collaborate With NVIDIA to Tackle Graphics’ Greatest Challenges
    NVIDIA’s latest academic collaborations in graphics research have produced a reinforcement learning model that smoothly simulates athletic moves, ultra-thin holographic glasses for virtual reality, and a real-time rendering technique for objects illuminated by hidden light sources. These projects — and over a dozen more — will be on display at SIGGRAPH 2022, taking place Aug. Read article > The post Setting AIs on SIGGRAPH: Top Academic Researchers Collaborate With NVIDIA to Tackle Graphics’ Greatest Challenges appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    `ReinforcementLearning.jl` presentation in the Julia User Group Munich
    https://youtu.be/ckIIxKM14Ow submitted by /u/kir0ul [link] [comments]
    Train your first Deep Reinforcement Learning agent to land correctly on the moon 🌕 (Deep Reinforcement Learning Free Class by Hugging Face 🤗)
    submitted by /u/chinacat2002 [link] [comments]  ( 1 min )
    Train your first Deep Reinforcement Learning agent to land correctly on the moon 🌕 (Deep Reinforcement Learning Free Class by Hugging Face 🤗)
    Hey there! We're happy to announce that we just published the first Unit of Deep Reinforcement Learning Class) 🥳 In this Unit,you'll learn the foundations of Deep RL. And you’ll train your first lander agent🚀 to land correctly on the moon 🌕 using Stable-Baselines3 and share it with the community. You’ll be able to compare the results of your LunarLander-v2 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/ThomasSimonini/Lunar-Lander-Leaderboard 1️⃣ The introduction to deep learning article 👉 https://huggingface.co/blog/deep-rl-intro 2️⃣ The hands-on 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit1/unit1.ipynb 3️⃣ The leaderboard 👉 https://huggingface.co/spaces/ThomasSimonini/Lunar-Lander-Leaderboard If you have questions and feedback I would love to answer, https://preview.redd.it/3s48ncsf2hx81.png?width=1920&format=png&auto=webp&s=0a08759ee9110be333208f27efd369818488d0c7 submitted by /u/cranthir_ [link] [comments]  ( 1 min )
    How to handle problems where the size of the state is dependent on actions and available actions depends on the state?
    Hello guys, quite new to this group and RL. The problem I want to deal with is a more complicated version of allocation/assignment problem. Here the state space is not only large but the size of the state also changes for different situations. And the action space is same as the state space. Is there any examples/algorithms I should look into about this? submitted by /u/Blue_Dude3 [link] [comments]  ( 2 min )
    Performance of policy (reward) massively deteriorates after a certain amount of iterations
    Hi all, as you can see below in the plot "rewards", the rewards seem to be really good at a few iterations, but deteriorates again and then destroyed from 50k iterations. Will there be any method to prevent the reward from swinging so much and make it somehow constantly increase? (Decreasing the learning rate didn't help...) What does the low reward from 50k iterations imply? https://preview.redd.it/gz9tzfbuxex81.png?width=1568&format=png&auto=webp&s=ef5cf4356ba67ed27c61b0648d9d8519773a66c1 submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
  • Open

    Improve Performance and Data Availability with Elastic Block Store (EBS)
    Author: Bokang Zhang, Chenhao Huang  Post Summary:   Nowadays, many Database-as-a-Service (DBaaS) solutions separate the computation layer and the storage layer. These include, for example, Amazon Aurora and Google BigQuery. This solution is attractive, as the data storage and data replication can be handled by existing services. DBaaS takes off the need to worry about this… Read More »Improve Performance and Data Availability with Elastic Block Store (EBS) The post Improve Performance and Data Availability with Elastic Block Store (EBS) appeared first on Data Science Central.  ( 6 min )
    Do regulatory data projects really need design-time data lineage? Probably not.
    Your regulatory data project likely has no use case for design-time data lineage. tl/dr Mapping Data Lineage at design time, for its own end, has no regulatory use case or ROI.  Buying a specialist tool to support that mapping has even less ROI.  Regulations see that kind of documentary data lineage as ancillary at best.… Read More »Do regulatory data projects really need design-time data lineage? Probably not. The post Do regulatory data projects really need design-time data lineage? Probably not. appeared first on Data Science Central.  ( 10 min )
    Seven data-centric architecture innovations that businesses can’t afford to overlook
    My mentors and compadres at Semantic Arts hold a Data-Centric Architecture Forum (DCAF) every year in Fort Collins, Colorado. It’s a chance for technically inclined data and data modeling enthusiasts to brainstorm on the gaps in the architecture and how to fill those gaps. This year’s DCAF takes place on June 6th – 8th, 2022.… Read More »Seven data-centric architecture innovations that businesses can’t afford to overlook The post Seven data-centric architecture innovations that businesses can’t afford to overlook appeared first on Data Science Central.  ( 4 min )
    Today’s Data Purgatory and a History Lesson from the Energy Industry
    Now I shall sing the second kingdom there where the soul of man is cleansed, made worthy to ascend to Heaven. Dante Alighieri, The Divine Comedy, Purg. I.4–9 In its most recent (April 2022) Data and Analytics Trends report, Gartner made a number of good points in and around the subject of AI and how… Read More »Today’s Data Purgatory and a History Lesson from the Energy Industry The post Today’s Data Purgatory and a History Lesson from the Energy Industry appeared first on Data Science Central.  ( 4 min )
    Handling SQL Server Database Corruption When Your File System Fails
    We have designed a lab environment to do some tests on some devices. One of the devices that we had tested was a WD My Book World Edition 1TB (White Light). When we plugged in the power cord, it sounded like it started up fine however, the web interface was not accessible and there was… Read More »Handling SQL Server Database Corruption When Your File System Fails The post Handling SQL Server Database Corruption When Your File System Fails appeared first on Data Science Central.  ( 4 min )
    The Advent of Killer Robots
    Next-generation warfare involves human-machine teaming (HMT). Human-AI interfaces are more error prone than traditional combat. Autonomous robots as a solution poses ethical challenges. At the core of the military’s use of human-machine teaming (HMT) is artificial intelligence used to program flights and target drone strikes. But while drones remove combat pilots from dangerous situations, the… Read More »The Advent of Killer Robots The post The Advent of Killer Robots appeared first on Data Science Central.  ( 3 min )
    The executable digital twin
    Recently I read that Siemens has defined a new term called ‘executable digital twin’ If you have worked in the space of digital twins before you know that there is no shortage of definitions for digital twins! At first, I thought, why muddy the waters with yet one more definition But I think there is… Read More »The executable digital twin The post The executable digital twin appeared first on Data Science Central.  ( 3 min )
    Some Helpful Tips to Choose the Best Domain Registrar
    When you’re planning to purchase the domain, you need to go through a domain registrar. Apart from the overwhelming process of choosing a domain name, you also need to be careful while choosing the domain registrar.  Domain registrars are reputed and professional companies that sell different types of domain names for the websites and manage… Read More »Some Helpful Tips to Choose the Best Domain Registrar The post Some Helpful Tips to Choose the Best Domain Registrar appeared first on Data Science Central.  ( 3 min )
    Green Banking for Sustainability and Responsible Environmental Protection
    Green banking's a way of banking that aims for sustainability and responsibility in order to protect the environment. The post Green Banking for Sustainability and Responsible Environmental Protection appeared first on Data Science Central.  ( 4 min )
    Benefits of Data Governance
    Data governance is the process of managing the data’s usability, security, availability, and quality within an organization using internally set and enforced rules and policies. Data governance is a must for any organization that seeks to use its data for analysis. It creates an environment where data can thrive as a source of useful insight… Read More »Benefits of Data Governance The post Benefits of Data Governance appeared first on Data Science Central.  ( 3 min )
    Smart Cities of the Future- Powered by IoT
    A smart city is an urban zone which uses different types of electronic systems and sensors to collect data. The smart city concept combines information and communication technology and several physical devices linked to Internet of Things networks to enhance adeptness of the city operations and services. The smart city concept is gaining important market… Read More »Smart Cities of the Future- Powered by IoT The post Smart Cities of the Future- Powered by IoT appeared first on Data Science Central.  ( 2 min )
    Is Detailed Design Anti-Agile?
    In the Beginning . . . In days of yore, systems development projects were front-ended with laborious requirements engineering and design tasks.  This made sense then because development was labor-intensive, time-consuming, and expensive.  Changes to the scope or design of a solution mid-development increased the likelihood of errors and incremental time and expense.  In recognition… Read More »Is Detailed Design Anti-Agile? The post Is Detailed Design Anti-Agile? appeared first on Data Science Central.  ( 9 min )
    5 data trends business leaders should anticipate in 2022
    Not surprisingly, digital transformation is a prerequisite for forward-thinking businesses. The catastrophic disruption of the global pandemic did not slow down the need for systems, processes, and people who will help modern organizations move faster. Data, as always, is top of mind.  With so many trends and tools available, it can be hard to see… Read More »5 data trends business leaders should anticipate in 2022 The post 5 data trends business leaders should anticipate in 2022 appeared first on Data Science Central.  ( 6 min )
  • Open

    Does anyone have a good code example of a MLP Neural Network with Momentum implemented?
    I have been asked to make a MLP Neural Network code from scratch with learning rate and momentum implemented. submitted by /u/TobiasFred [link] [comments]  ( 1 min )
  • Open

    Artificial intelligence system learns concepts shared across video, audio, and text
    A machine-learning model can identify the action in a video clip and label it, without the help of humans.  ( 6 min )

  • Open

    [D] Any tips for handling sparse, discrete, temporal data?
    Hi, I'm wondering if anyone could help familiarize me with modern feature engineering methods for handling this type of data in the context of machine learning. Background The data I'm talking about is hospital/facility utilization data for patients. Below is a table that is similar to the data I have for each patient. In this example we can see that this single patient went to the ED on the 5th day, was admitted into the hospital on the 5th day, and was admitted to a SNF (Skilled Nursing Facility -basically a nursing home) on the 8th day: ED Hospital SNF Day 1 0 0 0 Day 2 0 0 0 Day 3 0 0 0 Day 4 0 0 0 Day 5 1 1 0 Day 6 0 0 0 Day 7 0 0 0 Day 8 0 0 1 Day 9 0 0 0 In reality, we have more types of facilities as columns and years worth of days. Problem…  ( 2 min )
    [P] Deploying my MIDI generator
    Not sure if this is the best place for this, as I am not a Machine Learning engineer. I have developed a web application that lets you generate and remix MIDI with NLP Transformers. Right now it runs on Flask. The model is fairly small as far as transformers go (500Mb) and doesn't use very much memory when running. It is only doing inference and so it runs for about 5-15s while processing a job, containerized it uses less than 4Gb memory at all times. I am trying to figure out how to deploy my Flask server to EC2 without incurring massive compute costs. It seems like all of the instances with GPU or Spot GPU support are prohibitively expensive and provide way more memory and local storage than I would need. Have any of you encountered a similar problem? Are there any instances that you suggest? Project here: https://github.com/pickles976/chiptune-ai submitted by /u/EuphoricFedoria [link] [comments]  ( 1 min )
    [D] Democratizing Diffusion Models - LDMs: High-Resolution Image Synthesis with Latent Diffusion Models, a 5-minute paper summary by Casual GAN Papers
    Diffusion models (DMs) have a more stable training phase than GANs and less parameters than autoregressive models, yet they are just really resource intensive. The most powerful DMs require up to a 1000 V100 days to train (that’s a lot of $$$ for compute) and about a day per 1000 inference samples. The authors of Latent Diffusion Models (LDMs) pinpoint this problem to the high dimensionality of the pixel space, in which the diffusion process occurs and propose to perform it in a more compact latent space instead. In short, they achieve this feat by pertaining an autoencoder model that learns an efficient compact latent space that is perceptually equivalent to the pixel space. A DM sandwiched between the convolutional encoder-decoder is then trained inside the latent space in a more computationally-efficient way. In other words, this is a VQGAN with a DM instead of a transformer (and without a discriminator). As for the details, let’s dive in, shall we? Full summary: https://t.me/casual_gan/293 Blog post: https://www.casualganpapers.com/high-res-faster-diffusion-democratizing-diffusion/Latent-Disffusion-Models-explained.html Latent Diffusion Models arxiv / code Join the discord community and follow on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]  ( 1 min )
    [D] Training set size of SOTA models?
    Hi, I'm trying to get an idea of just how large datasets for big ML models, such Dall-e are. Does anyone have some references? GPT-3 is apparently trained on 45TB of data: [https://www.springboard.com/blog/data-science/machine-learning-gpt-3-open-ai](https://www.springboard.com/blog/data-science/machine-learning-gpt-3-open-ai) ​ How large are other big datasets? submitted by /u/rk3000 [link] [comments]
    [D] How to train large CLIP-style models?
    I'd like to train a medium-large contrastive model similar to CLIP. Something that will fit onto a single GPU but for which I can only compute a few batch elements at a time. The issue is that these types of models rely on other batch elements for the "contrastive" part of the equation. Bigger batch sizes lead to faster (and better) convergence. Is anyone aware of any papers (or have any hands-on tips) that address this issue? submitted by /u/neonbjb [link] [comments]  ( 1 min )
    [D] Focus on the Process: Formulating AI Ethics Principles More Responsibly
    Hi there, there is a new Gradient article some of you may find interesting: Focus on the Process: Formulating AI Ethics Principles More Responsibly Here's a preview of what it's about: It is tempting to respond to the present state in AI ethics by abandoning searches for principles. Given that there are so many principles out there already and so few tools to operationalize them, organizations might be inclined to simply use some of the existing principles and focus their attention on operationalization. However, a difficult question to answer is which principles to use. How do we know that organizations will choose well? What is to prevent them from cherry picking, for example? One way to go is to sift through the existing literature, looking for universal AI ethics principles. The hope might be that if we find universal principles, they could guide the development and evaluation of AI systems everywhere. Organizations that develop AI systems could focus on operationalizing them. Those who evaluate AI systems, such as investors, regulators, auditors, and consumers, could examine AI systems based on these principles. I advise against this approach. In this article, I explain why it is unlikely that universal AI ethics principles will be found and I discuss reasons to avoid using dominant trends as default. Instead, I suggest that each organization should articulate its own AI ethics principles, and I sketch ways to do so responsibly. Enjoy! submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    [D] Are multilayer perceptrons reversible?
    I have a multilayer perceptron that maps input X to output Y. Three hidden layers, with "tanh" activations in each, except the layer with linear (identity) activation leading the output Y. Can I reverse the network such that it would map Y to X? submitted by /u/boleslev [link] [comments]  ( 1 min )
    [D] Anyone working on Explanable AI?
    I am currently working at ML@Uber Is anyone here dealing with the problem of explanable AI? i.e. how are you able to understand and interpret predictions made by your machine learning models. Anyone here facing this problem or already have solved this problem? submitted by /u/gauravapiscean [link] [comments]  ( 1 min )
    [D] GCP compute enging pricing question
    I'm currently doing a little personal project which involves training a VQVAE (basically a conv NN and a variational autoencoder). I've got it all up and running after a few weeks of figuring things out and I'm finding that I'm being rinsed on costs. My basic setup is a c2-standard-4 VM (4 v-cpus and 16GB ram) and a 100GB disk mounted, I'm also spinning down the VM when not in use so I don't get a discount for continual usage. I started with an n1-standard-1 but found myself quickly running out of RAM, which still happens if my batch size is too high but less frequently. The disk is necessary as I'm dealing with large datasets but doesn't seem to be costing me much. As of right now my costs are about $4.5 per day, which obviously isn't that high, but for a personal project the costs soon rack up if I'm running this for days at a time every week. On my current setup I can run a batch size of 8, any higher than this and I run out of memory, and an epoch takes about 5 hours. The point I'm struggling with at the moment is can I be better optimising CPU vs runtime, i.e. is it worth swapping to slower/fewer but cheaper CPUs, or will it cost me just as much because they run longer? The other thing I'm not sure about is if I can gain much by using other platforms, my current setup isn't really that tied into the GCP ecosystem so I can switch fairly easily. I realise my questions are fairly general and everyone's setups are different, so I'm just looking for a point in the right direction if nothing else. Thanks! Edit: also if there's anywhere that can give me a decent amount of free hours I'd also be interested in that too! submitted by /u/Batteredcode [link] [comments]  ( 3 min )
    [D] Client-Side Caching Improves Feature Store Performance by 70% at DoorDash
    To enable our platform to support hundreds of data driven models and produce billions of model predictions we build a robust ML platform, feature store and prediction engine. This was only the beginning as the feature store at the heart of the platform utilized multiple TB's of memory in large Redis clusters, which needed to be optimized for cost and fast loading times for the optimal customer experience. To improve the feature store performance we implemented a caching layer but still needed to choose the best caching library, implement this solution and analyze the platform to set up experiments that would validate the new approach. I wanted to share this journey with the developer community so they can learn from my experience and how I was able to improve feature store performance by 70% at DoorDash. Please check out the article and let me know your thoughts on my approach: https://doordash.engineering/2022/05/03/how-we-applied-client-side-caching/ submitted by /u/csko7 [link] [comments]  ( 1 min )
    Twitter algorithm already open source?[D]
    I've been seeing some people talking about how elon's going to make twitter's algorithm open source but isn't it already publicly available? At least for Who To Follow WTF so I don't get why people are losing their minds over this. submitted by /u/lehmanmafia [link] [comments]  ( 2 min )
    [R] [D] SERF activation function - improving Swish
    https://arxiv.org/abs/2108.09598 Paper looks very promising, what do you think? Anyone tried SERF yet? submitted by /u/Shronnin [link] [comments]  ( 1 min )
    Sound clip Recommendation Model [D]
    Hello everyone! So I am starting a project where I need to build a model that can identify segments of audio that detect “problems with call center agents”. I have around 65,000 pieces of audio all that have had segments that have been labeled. I also have the transcribed text and the exact times the where the audio was segmented. I am thinking of starting with Data2Vec. I would like to hear your opinions on how you would approach this. Thanks in advance! submitted by /u/JS-AI [link] [comments]  ( 1 min )
    [D] Why are graph neural networks applied to non-graph structured data?
    Recently I have come across several papers that use graph neural networks for processing images, audio and video data. I can understand why GNNs are needed for data that have a graph structure (like molecules or traffic network etc). But for images and audio, what is the difference between using GNNs versus normal DNN? submitted by /u/Far_Conversation_445 [link] [comments]  ( 2 min )
    [D] Would an ML Ops platform be useful?
    Hey folk, I was wondering, would a ML Ops platform be something the community would embrace? A tool where from a CLI or GUI you can easily deploy models in a productive environment, or expose them as microservices? Something that solves the infrastructure bits and configs is what I wonder.. Or do people prefer to simply handle their own infra or manage on premises machines? submitted by /u/zedrakk [link] [comments]  ( 1 min )
    [D] Handling Missing Data or Encoding/Scaling. What comes first?
    Just out of curiosity, I just want to know whether you handle missing data before scaling/encoding of the dataset or after it. Thank you, submitted by /u/trj_flash75 [link] [comments]  ( 1 min )
    [R] Meta is releasing a 175B parameter language model
    submitted by /u/StellaAthena [link] [comments]  ( 2 min )
  • Open

    Is it correct to say that a nn with n layers is an nth order approximation?
    submitted by /u/HasFiveVowels [link] [comments]
    Stanford has trained AI to classify proteins
    submitted by /u/aidev2040 [link] [comments]
    Hi, im looking to build a neural network from scratch for a recommender system, that attempts to recommend anime programmes, could someone point me to relevant papers/books/resources?
    Hi, So for my final year project, I decided that im going to code a neural network from scratch for a recommender system, however I dont know the first thing about this topic so im going to be starting from the ground up. Could anyone point me in the right direction and recommend some resources etc? thanks! submitted by /u/tvvvs [link] [comments]  ( 1 min )
    CNN ERROR Inconsistent number of samples
    I am building a CNN network to detect DGA,DNS but i have encountered this problem and just don't know how to fix it. https://preview.redd.it/mnkfhb2cn8x81.png?width=719&format=png&auto=webp&s=ac1ec84b0721f43f36d0548173d12642b7ba0f26 submitted by /u/Dangerous_Intern_510 [link] [comments]
    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated)
    submitted by /u/maneesh123456 [link] [comments]
  • Open

    Democratizing Diffusion Models - LDMs: High-Resolution Image Synthesis with Latent Diffusion Models, a 5-minute paper summary by Casual GAN Papers
    Diffusion models (DMs) have a more stable training phase than GANs and less parameters than autoregressive models, yet they are just really resource intensive. The most powerful DMs require up to a 1000 V100 days to train (that’s a lot of $$$ for compute) and about a day per 1000 inference samples. The authors of Latent Diffusion Models (LDMs) pinpoint this problem to the high dimensionality of the pixel space, in which the diffusion process occurs and propose to perform it in a more compact latent space instead. In short, they achieve this feat by pertaining an autoencoder model that learns an efficient compact latent space that is perceptually equivalent to the pixel space. A DM sandwiched between the convolutional encoder-decoder is then trained inside the latent space in a more computationally-efficient way. In other words, this is a VQGAN with a DM instead of a transformer (and without a discriminator). As for the details, let’s dive in, shall we? Full summary: https://t.me/casual_gan/293 Blog post: https://www.casualganpapers.com/high-res-faster-diffusion-democratizing-diffusion/Latent-Disffusion-Models-explained.html Latent Diffusion Models arxiv / code Join the discord community and follow on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]  ( 1 min )
    Strange Galaxies (A.I animation + sound design)
    submitted by /u/nenomancer [link] [comments]
    Focus on the Process: Formulating AI Ethics Principles More Responsibly
    submitted by /u/regalalgorithm [link] [comments]
    1914 - People engaged in winter sports activities near Wetzlar, Hesse, Germany [Colored using AI]
    submitted by /u/pheonix_bird [link] [comments]
    Stanford has trained AI to classify proteins
    submitted by /u/aidev2040 [link] [comments]
    Common Voice Has A New Dataset
    submitted by /u/limapedro [link] [comments]
    7 Best Natural Language Processing Courses (2022) | Best NLP Courses
    submitted by /u/maneesh123456 [link] [comments]
    Robots still not around us?… — mathematicians to blame
    submitted by /u/marvelmind_robotics [link] [comments]  ( 1 min )
    A Look at Machine Learning
    submitted by /u/lutipri [link] [comments]
    Generating alternate Spongebob theme songs with OpenAI Jukebox
    submitted by /u/PF-Wang [link] [comments]
    AI fitness app
    Hi guys, recently came across a new fitness app with artificial intelligence, looks interesting and promising, and do you think AI can replace real trainers? Here's the link QR Kinestex (google.com) submitted by /u/vovayoung [link] [comments]
    Zama Open-Sources Concrete ML v0.2 To Support Data Scientists Without Any Prior Cryptography Knowledge To Automatically Turn Classical Machine Learning (ML) Models Into Their FHE Equivalent
    Zama is a Paris-based startup that aims to bring end-to-end encryption to AI by enabling developers to use Python to create models that run on encrypted data. In late April, the Zama Team released the public alpha release of Concrete ML, a package developed on top of Concrete Numpy. This release provides data scientists with no prior knowledge of cryptography with simple APIs for automatically converting traditional machine learning (ML) models into their FHE equivalents. One of the main goals of this version is to make using Concrete ML as convenient as possible for users of popular machine learning frameworks. Model training for linear models and trees is not reimplemented in Concrete ML, allowing researchers to utilize several variations and features of these models that the scikit-learn package supports. The team conducted a series of experiments to compare the models between scikit-learn and Concrete ML. When tested on simple 2D linear models, FHE performance was comparable to that of its unencrypted scikit-learn equivalents. However, as the number of dimensions increases in the current release, the performance of strongly quantized classifiers rapidly falls, which will be refined in future releases. On encrypted data, tree-based classifiers utilizing Concrete ML show outstanding accuracy. Running tree models that demand heavy comparisons can be easily enabled thanks to Zama’s unique approach to FHE that provides Programmable Bootstrapping. As a result, tree-based models perform as well as their scikit-learn/xgboost counterparts in FHE. This remains true even for datasets with many dimensions, and tree-based models are typically the most performant when dealing with tabular data. Data scientists can implement Decision Trees, Random Forests, and Gradient Boosted Trees after the team publishes specific deployment APIs. Continue Reading Github: https://github.com/zama-ai/concrete-ml submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    AI painting Marvel superheroes
    submitted by /u/notrealAI [link] [comments]  ( 1 min )
  • Open

    Alpa: Automated Model-Parallel Deep Learning
    Posted by Zhuohan Li, Student Researcher, Google Research, and Yu Emma Wang, Senior Software Engineer, Google Core Over the last several years, the rapidly growing size of deep learning models has quickly exceeded the memory capacity of single accelerators. Earlier models like BERT (with a parameter size of < 1GB) can efficiently scale across accelerators by leveraging data parallelism in which model weights are duplicated across accelerators while only partitioning and distributing the training data. However, recent large models like GPT-3 (with a parameter size of 175GB) can only scale using model parallel training, where a single model is partitioned across different devices. While model parallelism strategies make it possible to train large models, they are more complex in that th…  ( 8 min )
  • Open

    The Scientist of the Scientist
    Science has been the most important tool of humanity even before the dawn of history. Humans have used science, without even knowing that…  ( 5 min )
  • Open

    Quartal melody: Star Trek fanfare
    Intervals of a fourth, such as the interval from C to F, are common in western music, but consecutive intervals of this size are not. Quartal harmony is based on intervals of fourths, and quartal melodies use a lot of fourths, particularly consecutive fourths. Maybe the most famous quartal melody is the opening fanfare to […] Quartal melody: Star Trek fanfare first appeared on John D. Cook.  ( 2 min )
  • Open

    What are some top venues for the submission of a Reinforcement learning related paper?
    submitted by /u/blitzkreig3 [link] [comments]  ( 1 min )
    Explanation of Optimal Policies and Value Functions | Dynamic Programming
    submitted by /u/shani_786 [link] [comments]
  • Open

    Rethinking Human-in-the-Loop for Artificial Augmented Intelligence
    How do we build and evaluate an AI system for real-world applications? In most AI research, the evaluation of AI methods involves a training-validation-testing process. The experiments usually stop when the models have good testing performance on the reported datasets because real-world data distribution is assumed to be modeled by the validation and testing data. However, real-world applications are usually more complicated than a single training-validation-testing process. The biggest difference is the ever-changing data. For example, wildlife datasets change in class composition all the time because of animal invasion, re-introduction, re-colonization, and seasonal animal movements. A model trained, validated, and tested on existing datasets can easily be broken when newly collected dat…  ( 5 min )
  • Open

    ‘In the NVIDIA Studio’ Welcomes Concept Designer Yangtian Li
    This week In the NVIDIA Studio, we welcome Yangtian Li, a senior concept artist at Singularity6. Li is a concept designer and illustrator who has worked on some of the biggest video game franchises, including Call of Duty, Magic: the Gathering and Vainglory. Her artwork also appears in book illustrations and magazines. The post ‘In the NVIDIA Studio’ Welcomes Concept Designer Yangtian Li appeared first on NVIDIA Blog.  ( 3 min )

  • Open

    [D] How to do unsupervised anomaly detection in a principled way?
    I have terabytes of tabular and image data. I want to determine what data points are novel/anomalous. I have a couple anchor data points for entities that are known and features about said entities. Effectively I want to find things that are novel relative to my anchors without any labels for if something is actually interesting and not noise to train on. Here's an analogy: I have driving footage of a reference driver who is "perfectly" driving. I also have all of the sensor data: accelerometers, where the human is looking, etc. I simulate 1000s of self-driving cars that are running their own agent models. I get data on all of these as well. I deploy an agent to the real world that is expected to do better/comparable to the reference driver while having different model parameters. I don't know how this driving agent will do in the real world. I want to choose agent instantiations that are different from the rest in some regard (anomaly) I'm constrained by how many deployments I can make. It's expensive and dangerous. ​ An issue I face with this is that there are dozens of models that will give me anomalous points, but sometimes they don't overlap at all. There's no consistency. Ideally I want to find data points that are novel in feature space (far away?), but expected to be functional... on non-labeled data. So yea... after writing this it kind of sounds like just a screwed situation, but maybe someone here has an idea/experience of how to build a system that can enable this kind of parameter selection engine (suggesting novel parameters that are functional without any labeled data). submitted by /u/memproc [link] [comments]  ( 1 min )
    [P] Transfer Learning with BERT, number of examples.
    I'm looking to make a review classifier (3 classification). It sounds like transfer learning is standard these days. But how many examples would generally be needed to make an improvement on the out of the box result? Would 1,000 be sufficient or realistically would it make to be in the 10's of thousands. Or is it a "how long is a piece of string question" edit: sorry wrong tag submitted by /u/mldude8 [link] [comments]  ( 1 min )
    [D] Present ML classification model results for non technical
    I wanna include Machine Learning classification (str target) results into a web platform, what informations that I can show and will be meaningful , I can already think of (Predicted value, Accuracy, Probability) what else I can include, graphics or anything useful to present my results for non technical people (clients)? submitted by /u/According-Promise-23 [link] [comments]  ( 1 min )
    [D] How do some people publish so much in this field?
    I'm not talking about PIs necessarily, but sometimes I search up other grad students in my department or people who email me and some of them have 10+ first author papers a year? I don't really understand how this is possible; I don't think I'm the best implementer out there but it would take me at least 2-3 months to read the literature, come up with a new research question, run experiments, iterate and write up. I mostly work with just me and my advisor though, maybe that's the difference? If you're one of these people, are you just hyper-productive? Already supervising students? Or collaborating with a ton of people and not doing a high proportion of work on any project? submitted by /u/IPvIV [link] [comments]  ( 5 min )
    [P] Pretraining dense retrievers with masked language model objective(REALM)
    Hi, I made a video explaining REALM. It is a pretraining method for dense retrievers. It uses a language model along with a retriever for pretraining. Given a random masked sentence like "Each angle in an equilateral triangle is [MASK]", the retriever gets top passages that might contain information about equilateral triangles. The passages are then passed to a language model to predict the value for each "[MASK]" token. Using this MLM objective, as model performance improves so does the quality of retrieval. A simple and effective idea for pretraining. This is the final video of our series on Open-domain question answering using dense retrievers. I will appreciate any feedback. Thanks for the support till now. https://www.youtube.com/watch?v=aQcoI1t6HOs submitted by /u/infiniteakashe [link] [comments]  ( 1 min )
    [D] Where did MT-NLG go wrong with their scaling experiments, comparing its capabilities to PaLM?
    The MT-NLG model was 530B parameters compared to PaLM's 540B. They seem to have done things correctly from what I skimmed, However their model is neither that impressive on benchmarks, nor does it demonstrate any special capabilities. So what was the reason MT-NLG didn't work as well as expected? Is it possible it has abilities to explain jokes (on par PaLM) but they were undiscovered by the authors? Or are there any gaping flaws in how they scale the different hyperameters (heads, layers, dims etc.)? Perhaps such an analysis has already been done, but I would love to see what you guys think about why it underperformed... In such an unknown area as this, it seems that unless one scales models with multiple attempts it's hard to accurately judge when we would have reached the point where scaling laws fall off. submitted by /u/Competitive-Rub-1958 [link] [comments]  ( 1 min )
    [R] How to access CVPR 21 workshop extended abstracts?
    Hi everyone, I intend to submit a short paper to a 2022 CVPR workshop. I wanted to get inspiration from other CVPR short abstracts to see how they are written, their structure and so on. However, I'm really struggling a lot to find a place on the internet where I can download workshop extended abstracts from. Does anyone know where one can download the 2021 CVPR workshop short papers? submitted by /u/BigDataOverflow [link] [comments]  ( 1 min )
    [D] Train model to predict continuous variable ranging from 0 to 1 using images as input.
    I have thousands of images of structures that I am trying to use to train a NN to predict their parameters. The inputs are the 2D images of the structure, and the output is the 1 parameter I am trying to predict (I have 2 parameters I am trying to predict but one is fine too). I have tried using the dimensions of the structure to predict the parameters but the error was too high. What do you recommend I do for this? Links are appreciated submitted by /u/ftority [link] [comments]  ( 1 min )
    [D] How do you test (unit, integration) your Machine Learning models/pipelines?
    First things first, do you test them at all? Practices can differ across companies.. Secondly, I believe that testing process can differ based on the model use case (CV, NLP..) but is there any unified set of recommendations, good practices? Can general approaches regarding unit and integration testing practices be applied to ML models/pipelines? What is your approach to this? If you feel like writing maybe you could write in comments: Area of ML that you are talking about: traditional ml algorithms, CV models, NLP models What are you testing? Is it the validity of the forward pass? (example: unit testing the shapes) Is it the validity of the training loop? (somehow) Are you testing if your model's performance is above some threshold? (example: accuracy, F1, bleu, maybe execution time?) Are you testing if your model is making correct predictions on some crucial cases Are you testing important metrics related to your dataset? (example: if it's properly standardized) Some of this things can be tested pretty easily and don't require automated tests, but it is nice to have them. Do you have some other ways to do sanity checks? Please feel free to add other aspects of ML models/pipelines which are worth testing. Looking forward to your insights! submitted by /u/Icy_Fisherman7187 [link] [comments]  ( 2 min )
    [D] What program/software can make animations like this?
    Hi, looking for advice for software/programs that can make animations like this one - thanks! https://upload.wikimedia.org/wikipedia/commons/transcoded/9/92/Infinitely_wide_neural_network.webm/Infinitely_wide_neural_network.webm.360p.vp9.webm submitted by /u/unital [link] [comments]
    [R] A very preliminary analysis of DALL-E 2
    submitted by /u/hardmaru [link] [comments]  ( 1 min )
    [P] The easiest way to process and tag video data
    submitted by /u/happybirthday290 [link] [comments]  ( 3 min )
    [D] How to effectively sample from high dimensional space and create data-efficient training.
    So I have been working on generating data and creating a neural network to predict deformation in meshes, given some mesh parameters like thickness, elasticity, point of force, etc. What I know for sure is these parameters that I am creating a dataset for are/will be in a range and some meshes will have the same deformation for some combined different values of these parameters. What I have tried is uniformly sample from each of these parameters but this seems very data inefficient because there might be some blind spots in the combined sampling, say deformation for a particular combination of parameters can be different and unseen. My questing is how do you efficiently sample from a large multi-dimensional space, is there a better training method that would somehow inform another network to sample efficiently? ​ I will be happy to explain more if this is somehow unclear. submitted by /u/bitemenow999 [link] [comments]  ( 1 min )
    [D] Multiple PDF Files Similarity
    Hi Everyone, I am developing one application in that I have multiple pdf files which user will upload then application will group those PDFs according to their similarity%(% entered by user) if user enters 80% it means documents with atleast 80% similarity will group in batch 1 and so on (similar pdf files will group into one batches i e. Batch 1, batch 2....). PDFs documents can be text or image or collection of both. I am using .Net Core and Angular. I tried Kmean algo but results are different everytime as an Algo is picking random files for centroids. How can i implement this? submitted by /u/ConsciousSlice4329 [link] [comments]  ( 1 min )
  • Open

    Robert Magno (Run:AI) - Building the Best AI Infrastructure Stack to Accelerate Your Data Science
    submitted by /u/Dracutela [link] [comments]
    Weekly China AI News: Beijing Issues First-Ever Driverless Robotaxi Permits; Huawei Anticipates AI Compute to Jump 500 Times by 2030; CogView2 Challenges DALL-E-2 With Better Results
    submitted by /u/trcytony [link] [comments]  ( 1 min )
    How many years away do you think we are from AI that can turn our old games into something that looks like UE5?
    To me it seems to make sense that we would begin relying more and more on AI to improve our old games. For example, ffxiv won't be getting raytracing anytime soon, but I imagine it's only a matter of time before I could download some AI-based image software (like Dall-E) that lets me tweak my gameplay experience to look much more realistic. Do you think I'm right in assuming this is the likely trajectory of AI's use in game graphics? And if the answer to 1 is yes, how far away do you think such technology is? Thank you! submitted by /u/solidwhetstone [link] [comments]  ( 1 min )
    Do you think that AI will ever be as smart as humans (AGI)? If you do think so, when or around what time period?
    Hey guys. I'm working on a project for school and would like to hear your guy's opinions as the leading AI subreddit on whether or not AGI will be achieved, and if so, when. I'd greatly appreciate your input! submitted by /u/SurroundSwimming3494 [link] [comments]  ( 3 min )
    Last Week in AI: AI helps model volcanoes, Anthropic gets $580M for more explainable AI, AI algorithms that screen for child neglect, and more!
    submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    What's the involvement of AI in Transforming the Media & Entertainment
    Artificial intelligence (AI) will continue to disrupt the media sector, just as it did in 2020 and 2021. AI will most likely fulfill the three critical roles of recommendation, speech recognition, and media automation in this market. Read more: Artificial Intelligence will Continue to Transform the M&E Landscape submitted by /u/JencyJane [link] [comments]  ( 1 min )
    Artificial Intelligence Implications: The Future of Modern Wargaming! (Would love some feedback on a University blog post surrounding Wargaming and the use of AI!)
    submitted by /u/RvZz11 [link] [comments]  ( 1 min )
    5 Best Machine Learning Courses for Beginners, Advanced learn in 2022 -
    submitted by /u/maneesh123456 [link] [comments]
    How to deepfake an audio where you just change the voice but keep what’s been said.
    For example: The scene from Taxi Driver where the protagonist speaks with himself in front of the mirror, how to keep what he’s saying but change his voice for Morgan Freeman’s voice. submitted by /u/Accomplished-Door-61 [link] [comments]  ( 1 min )
    Web Scraping with Python - Learning the Basics | Rubik's Code
    submitted by /u/RubiksCodeNMZ [link] [comments]
    Spoofing detector using YoloV4 Tiny 3L
    submitted by /u/Gloomy_Recognition_4 [link] [comments]  ( 1 min )
    AI2 Open-Sources ‘LM-Debugger’: An Interactive Tool For Inspection And Intervention In Transformer-Based Language Models
    In natural language processing, a language model is a probabilistic statistical model that calculates the likelihood of a specific sequence of words appearing in a phrase based on the preceding words. As a result, it’s common in predictive text input systems, speech recognition, machine translation, and spelling correction, among other applications. They are a method of converting qualitative text information into quantitative data that machines can interpret. Modern NLP models rely on transformer-based language models (LMs). However, a lot more research is to be done under their fundamental prediction development process. Unclear prediction behavior becomes an obstacle for both end-users who don’t comprehend why a model generates certain predictions and developers who want to diagnose or fix model behavior. A new paper published by a group of researchers from Allen Institute for AI, Tel Aviv University, Bar-Ilan University, and the Hebrew University of Jerusalem introduces LM-Debugger, an interactive open-source tool for fine-grained interpretation and intervention in LM predictions. This work will increase the transparency of LMs. Continue Reading Paper: https://arxiv.org/abs/2204.12130 Github: https://github.com/mega002/lm-debugger submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    A new Trading AI?
    Hello Everybody. I have this idea for a trading bot but realised its more in the realm of AI. Im highly uneducated in this field and dont know what im talking about. If possible could someone please answer a few questions. I have found a super successful Day Trader that ive been learning from for a couple months. Hes supposedly "reverse engineered the markets" Ive drawn the trend lines and done the technical analysis hes taught me and its scarily accurate. My only problem is Day Tradimg takes time. So if i could have a bot understand his "science" and do it while im not there it would be a free money hack lol. Q1: How hard is it on a scale of 1-10 to develop/create a Trading AI? Q2: Could someone estimate what a project like this would cost me? Q3: Has a trading AI ever been created? I really appreciate if you read through this and i would really really appreciate answers. Thanks :) submitted by /u/just_conor12 [link] [comments]  ( 3 min )
    Poker AI Plays Itself
    submitted by /u/bluboxsw [link] [comments]
    AI News | AI Powered Robotic Boat Autonomously Cleans Harbors And Rivers | AI Cataract Detection | Machine Learning To Only Propose Molecules Which Are Synthesizable In Lab
    submitted by /u/getrich_or_diemining [link] [comments]  ( 1 min )
    Researchers From MIT and Cornell Develop STEGO (Self-Supervised Transformer With Energy-Based Graph Optimization): A Novel AI Framework That Distills Unsupervised Features Into High-Quality Discrete Semantic Labels
    Unsupervised semantic segmentation seeks to uncover and localize semantically significant categories within image corpora without any annotation. However, there are several challenges in creating annotated training data. These challenges frequently often outweigh semantic segmentation methods’ superior accuracy. Algorithms must develop features for every pixel that are both semantically relevant and compact enough to form discrete clusters to extract meaningful categories with any annotation from the training data. A team of researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), Google, and Cornell University has achieved this by creating a machine learning model named STEGO (Self-supervised Transformer with Energy-based Graph Optimization) that surpasses previous methods by decoupling feature learning from cluster compactification. A frozen backbone makes up STEGO, and it serves as a source of learning feedback and input to the segmentation head for predicting distilled characteristics. This segmentation head is a direct feed-forward network with a ReLU activation function. Unlike earlier studies, the algorithm’s efficiency was increased without retraining or fine-tuning the backbone. The STEGO neural network retrieves global image information by pooling spatial variables in a global average. Then, based on the cosine similarity in the backbone’s feature space, a lookup table is computed for each image’s K-Nearest Neighbours. Continue Reading Paper: https://arxiv.org/pdf/2203.08414.pdf Github: https://github.com/mhamilton723/STEGO submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
  • Open

    What background knowledge is needed to understand traditional exploration methods?
    I can understand most traditional and current methods of reinforcement learning, however when it comes to traditional exploration methods such as UCB or PUCB I am mostly lost. I can understand the algorithms intuitively when explained, but when I go over the papers and look at the proofs/explanations there seems to be some background knowledge I am not privy to. The book Reinforcement Learning: An Introduction doesn't seem to go over these in-depth as well. BTW I have a bachelor's and master's degree in Computer Science and have taken math classes such as statistics, calculus, LA, algorithms, ect. submitted by /u/Stochastic_Machine [link] [comments]  ( 1 min )
    Help needed to understand Rllib attention model parameters
    Hi! I recently started to learn about using attention (transformers). I use rllib implementation (https://docs.ray.io/en/latest/rllib/rllib-models.html#default-model-config-settings) which follows some paper. So far I have some understanding of transformers but I still don't understand everything. When playing with parameters I saw couple which I don't understand how these affect the learning and what they exactly do ( I looked at code and paper but still don't understand). Can somebody explain what are these and how these affect the learning: 1) attention_memory_inference 2) attention_memory_training (also related to previous point: when training why attention_memory_training and input sequence are concated? https://github.com/ray-project/ray/blob/master/rllib/models/torch/attention_net.py#L91 ) Many thanks :D submitted by /u/HBlackwooder [link] [comments]  ( 1 min )
    Is this how a simple parallel environment training works in DDPG/TD3 model?
    I tried to figure out how to do parallel environment training using TD3 or DDPG but just got bits and pieces of information, so this is what I came up with. This is what I have (a Unity "ant" environment): One ant in one environment But I want this: Parallel ants in multiple environments As a result, I can learn faster because I have more data. So here's how I envision parallel training: Agents collect actions from one TD3/DDPG model. Collect observations from all agents and put them into the memory buffer. Do 1 and 2 steps until every agent finishes its episode. If the agent finished its episode ahead of other agents, it simply waits. Then comes the training DDPG/TD3 step. And repeat. Would this type of training even work? submitted by /u/Dougller [link] [comments]  ( 1 min )
    Unity with MLAgents, Isaac Gym, OpenAI Gym and other environments to experiment with reinforcement learning
    Hello, I am a master's student in computer science and I am specializing in artificial intelligence. I am approaching reinforcement learning for the first time in an intelligent robotics course, and would like to experiment with these techniques in a simulated environment. I'm looking for something that is not too complex to learn (I want to focus more on algorithm implementation and simulation rather than spending months figuring out the tool) and that gives enough freedom to modify and implement. Unity seemed interesting because a possible application of these techniques that I would like to explore is in the world of gaming and simulation, but I do not know if a solid knowledge of Unity is required to set up a decent environment before you can get into the experiments. Moreover, I don't know up to what level of detail is possible to enter in the case of MLAgents and how much is possible to customize. Do you recommend one of these that I proposed in the title or also other? Also, do you recommend some material (books, videos, courses, tutorials) to study that is a good compromise between explanation of the tool and theoretical part? Thanks in advance! submitted by /u/Parruck [link] [comments]  ( 2 min )
    creating custom environment for training RL agent
    Hi, I'm working on a project on 2d image reassembly using reinforcement learning. I want to create a custom (puzzle) environment for doing RL. Is there anyway to create a custom environment that can be integrated with gym. If so, how to create one? Thanks. submitted by /u/Praveen_Raja22 [link] [comments]  ( 1 min )
  • Open

    Achieve hyperscale performance for model serving using NVIDIA Triton Inference Server on Amazon SageMaker
    Machine learning (ML) applications are complex to deploy and often require multiple ML models to serve a single inference request. A typical request may flow across multiple models with steps like preprocessing, data transformations, model selection logic, model aggregation, and postprocessing. This has led to the evolution of common design patterns such as serial inference […]  ( 15 min )
    Build a corporate credit ratings classifier using graph machine learning in Amazon SageMaker JumpStart
    Today, we’re releasing a new solution for financial graph machine learning (ML) in Amazon SageMaker JumpStart. JumpStart helps you quickly get started with ML and provides a set of solutions for the most common use cases that can be trained and deployed with just a few clicks. The new JumpStart solution (Graph-Based Credit Scoring) demonstrates […]  ( 8 min )
    Increase your content reach with automated document-to-speech conversion using Amazon AI services
    Reading the printed word opens up a world of information, imagination, and creativity. However, scanned books and documents may be difficult for people with vision impairment and learning disabilities to consume. In addition, some people prefer to listen to text-based content versus reading it. A document-to-speech solution extends the reach of digital content by giving […]  ( 8 min )
  • Open

    Do you usually build neural networks from scratch or do you use a library dedicated to it?
    I´´m curious, I´ve seen both and while the ones using libraries sound harder to make, the ones made from scratch sound slightly sloppier. For context, I mean for predator-prey simulations and such, not image to text recognition or anything that hard. submitted by /u/HesAMagicalPoney55 [link] [comments]  ( 1 min )
    [Question] Interval Branch and Bound
    Hello everyone, I'm studying the Branch and Bound algorithm, and I got a question (Reddit looks like the perfect place to ask other experienced people). I already have a working Branch and Bound algorithm but I want to adapt it to be working with intervals instead of one single value. And how do I do it? ​ On Branch and Bound I have something like: For each sample, I'm calculating the costs and then running the bab algorithm with those and my values. if (x - y < 0) then keep going with the tree else stop with the search But when adapting for Interval Branch and Bound would be something like: if ((x + ∆x) - (y+∆y) < 0) keep going with the tree else stop with the search and then I still need to check for the other bound (x - ∆x). ​ I will need to calculate twice the costs (once for each bound) and search the tree one time also for each cost? I can be saying a really big mistake, but that's why I'm here. Someone can help be clarifying my problem? Thank you for your attention. submitted by /u/Zarathos_PT [link] [comments]  ( 1 min )
    Future Computers Will Be Radically Different
    submitted by /u/keghn [link] [comments]
  • Open

    Why target ads at pregnant women
    I’m listening to a podcast interviewing Neil Richards, the author of Why Privacy Matters. Richards makes a couple interesting points about the infamous example of Target figuring out which women were pregnant based on their purchase history. First, pregnancy is a point at which women are open to trying new things. So if a company […] Why target ads at pregnant women first appeared on John D. Cook.  ( 1 min )
    Curiously simple approximations
    As I’ve written about here and elsewhere, the following simple approximations are fairly accurate. log10 x ≈ (x-1)/(x+1) loge x ≈ 2 (x – 1)/(x + 1) log2 x ≈ 3(x – 1)/(x + 1) It’s a little surprising that each is as accurate as it is, but it’s also surprising that the approximations for […] Curiously simple approximations first appeared on John D. Cook.  ( 1 min )
  • Open

    Mown Away: Startup Rolls Out Autonomous Lawnmower With Cutting-Edge Tech
    Jack Morrison and Isaac Roberts, co-founders of Replica Labs, were restless two years after their 3D vision startup was acquired, seeking another adventure. Then, in 2018, when Morrison was mowing his lawn, it struck him: autonomous lawn mowers. The two, along with Davis Foster, co-founded Scythe Robotics. The company, based in Boulder, Colo., has a Read article > The post Mown Away: Startup Rolls Out Autonomous Lawnmower With Cutting-Edge Tech appeared first on NVIDIA Blog.  ( 3 min )
    Meet the Omnivore: 3D Artist Creates Towering Work With NVIDIA Omniverse
    Edward McEvenue grew up making claymations in LEGO towns. Now, he’s creating photorealistic animations in virtual cities, drawing on more than a decade of experience in the motion graphics industry. The post Meet the Omnivore: 3D Artist Creates Towering Work With NVIDIA Omniverse appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Business Model Transformation: Keys to Monetizing the Edge
    I was conducting a Tech Talk at a client when I mentioned the astronomical growth of data at “the edge”; that data creation at the edge is growing almost as fast as that in the cloud according to IDC. This sent a noticeable ripple across the executive team audience.  The executives immediately began debating “How… Read More »Business Model Transformation: Keys to Monetizing the Edge The post Business Model Transformation: Keys to Monetizing the Edge appeared first on Data Science Central.  ( 5 min )
    Are PDF Documents a Thing of the Past?
    There has been many articles predicting the death of the PDF format, invented in 1993. Some of these articles are 10 years old: you can find them by googling “death of PDF”. With the advent of fluid or liquid layout design on almost every website, widespread Internet browsing on small devices (cell phones), notebooks combining… Read More »Are PDF Documents a Thing of the Past? The post Are PDF Documents a Thing of the Past? appeared first on Data Science Central.  ( 4 min )
  • Open

    10 Ways Technology is Changing Healthcare: How Innovation is Impacting the Medical Industry
    Source: Managed Healthcare Executive  ( 6 min )
  • Open

    Using Kaggle in Machine Learning Projects
    You’ve probably heard of Kaggle data science competitions, but did you know that Kaggle has many other features that can help you with your next machine learning project? For people looking for datasets for their next machine learning project, Kaggle allows you to access public datasets by others and share your own datasets. For those […] The post Using Kaggle in Machine Learning Projects appeared first on Machine Learning Mastery.  ( 7 min )
  • Open

    On the Optimization of Margin Distribution. (arXiv:2204.14118v1 [cs.LG])
    Margin has played an important role on the design and analysis of learning algorithms during the past years, mostly working with the maximization of the minimum margin. Recent years have witnessed the increasing empirical studies on the optimization of margin distribution according to different statistics such as medium margin, average margin, margin variance, etc., whereas there is a relative paucity of theoretical understanding. In this work, we take one step on this direction by providing a new generalization error bound, which is heavily relevant to margin distribution by incorporating ingredients such as average margin and semi-variance, a new margin statistics for the characterization of margin distribution. Inspired by the theoretical findings, we propose the MSVMAv, an efficient approach to achieve better performance by optimizing margin distribution in terms of its empirical average margin and semi-variance. We finally conduct extensive experiments to show the superiority of the proposed MSVMAv approach.  ( 2 min )
    Intuitive Shape Editing in Latent Space. (arXiv:2111.12488v2 [cs.CV] UPDATED)
    The use of autoencoders for shape editing or generation through latent space manipulation suffers from unpredictable changes in the output shape. Our autoencoder-based method enables intuitive shape editing in latent space by disentangling latent sub-spaces into style variables and control points on the surface that can be manipulated independently. The key idea is adding a Lipschitz-type constraint to the loss function, i.e. bounding the change of the output shape proportionally to the change in latent space, leading to interpretable latent space representations. The control points on the surface that are part of the latent code of an object can then be freely moved, allowing for intuitive shape editing directly in latent space. We evaluate our method by comparing to state-of-the-art data-driven shape editing methods. We further demonstrate the expressiveness of our learned latent space by leveraging it for unsupervised part segmentation.  ( 2 min )
    Network Topology Optimization via Deep Reinforcement Learning. (arXiv:2204.14133v1 [cs.NI])
    Topology impacts important network performance metrics, including link utilization, throughput and latency, and is of central importance to network operators. However, due to the combinatorial nature of network topology, it is extremely difficult to obtain an optimal solution, especially since topology planning in networks also often comes with management-specific constraints. As a result, local optimization with hand-tuned heuristic methods from human experts are often adopted in practice. Yet, heuristic methods cannot cover the global topology design space while taking into account constraints, and cannot guarantee to find good solutions. In this paper, we propose a novel deep reinforcement learning (DRL) algorithm, called Advantage Actor Critic-Graph Searching (A2C-GS), for network topology optimization. A2C-GS consists of three novel components, including a verifier to validate the correctness of a generated network topology, a graph neural network (GNN) to efficiently approximate topology rating, and a DRL actor layer to conduct a topology search. A2C-GS can efficiently search over large topology space and output topology with satisfying performance. We conduct a case study based on a real network scenario, and our experimental results demonstrate the superior performance of A2C-GS in terms of both efficiency and performance.  ( 2 min )
    Goldilocks-curriculum Domain Randomization and Fractal Perlin Noise with Application to Sim2Real Pneumonia Lesion Detection. (arXiv:2204.13849v1 [cs.CV])
    A computer-aided detection (CAD) system based on machine learning is expected to assist radiologists in making a diagnosis. It is desirable to build CAD systems for the various types of diseases accumulating daily in a hospital. An obstacle in developing a CAD system for a disease is that the number of medical images is typically too small to improve the performance of the machine learning model. In this paper, we aim to explore ways to address this problem through a sim2real transfer approach in medical image fields. To build a platform to evaluate the performance of sim2real transfer methods in the field of medical imaging, we construct a benchmark dataset that consists of $101$ chest X-images with difficult-to-identify pneumonia lesions judged by an experienced radiologist and a simulator based on fractal Perlin noise and the X-ray principle for generating pseudo pneumonia lesions. We then develop a novel domain randomization method, called Goldilocks-curriculum domain randomization (GDR) and evaluate our method in this platform.  ( 2 min )
    A review of Federated Learning in Intrusion Detection Systems for IoT. (arXiv:2204.12443v2 [cs.CR] UPDATED)
    Intrusion detection systems are evolving into intelligent systems that perform data analysis searching for anomalies in their environment. The development of deep learning technologies opened the door to build more complex and effective threat detection models. However, training those models may be computationally infeasible in most Internet of Things devices. Current approaches rely on powerful centralized servers that receive data from all their parties -- violating basic privacy constraints and substantially affecting response times and operational costs due to the huge communication overheads. To mitigate these issues, Federated Learning emerged as a promising approach where different agents collaboratively train a shared model, neither exposing training data to others nor requiring a compute-intensive centralized infrastructure. This paper focuses on the application of Federated Learning approaches in the field of Intrusion Detection. Both technologies are described in detail and current scientific progress is reviewed and categorized. Finally, the paper highlights the limitations present in recent works and presents some future directions for this technology.
    Forecasting large-scale circulation regimes using deformable convolutional neural networks and global spatiotemporal climate data. (arXiv:2202.04964v2 [cs.LG] UPDATED)
    Classifying the state of the atmosphere into a finite number of large-scale circulation regimes is a popular way of investigating teleconnections, the predictability of severe weather events, and climate change. Here, we investigate a supervised machine learning approach based on deformable convolutional neural networks (deCNNs) and transfer learning to forecast the North Atlantic-European weather regimes during extended boreal winter for 1 to 15 days into the future. We apply state-of-the-art interpretation techniques from the machine learning literature to attribute particular regions of interest or potential teleconnections relevant for any given weather cluster prediction or regime transition. We demonstrate superior forecasting performance relative to several classical meteorological benchmarks, as well as logistic regression and random forests. Due to its wider field of view, we also observe deCNN achieving considerably better performance than regular convolutional neural networks at lead times beyond 5-6 days. Finally, we find transfer learning to be of paramount importance, similar to previous data-driven atmospheric forecasting studies.
    SleepPPG-Net: a deep learning algorithm for robust sleep staging from continuous photoplethysmography. (arXiv:2202.05735v4 [cs.LG] UPDATED)
    Introduction: Sleep staging is an essential component in the diagnosis of sleep disorders and management of sleep health. It is traditionally measured in a clinical setting and requires a labor-intensive labeling process. We hypothesize that it is possible to perform robust 4-class sleep staging using the raw photoplethysmography (PPG) time series and modern advances in deep learning (DL). Methods: We used two publicly available sleep databases that included raw PPG recordings, totalling 2,374 patients and 23,055 hours. We developed SleepPPG-Net, a DL model for 4-class sleep staging from the raw PPG time series. SleepPPG-Net was trained end-to-end and consists of a residual convolutional network for automatic feature extraction and a temporal convolutional network to capture long-range contextual information. We benchmarked the performance of SleepPPG-Net against models based on the best-reported state-of-the-art (SOTA) algorithms. Results: When benchmarked on a held-out test set, SleepPPG-Net obtained a median Cohen's Kappa ($\kappa$) score of 0.75 against 0.69 for the best SOTA approach. SleepPPG-Net showed good generalization performance to an external database, obtaining a $\kappa$ score of 0.74 after transfer learning. Perspective: Overall, SleepPPG-Net provides new SOTA performance. In addition, performance is high enough to open the path to the development of wearables that meet the requirements for usage in clinical applications such as the diagnosis and monitoring of obstructive sleep apnea.
    Exploration and Exploitation in Federated Learning to Exclude Clients with Poisoned Data. (arXiv:2204.14020v1 [cs.DC])
    Federated Learning (FL) is one of the hot research topics, and it utilizes Machine Learning (ML) in a distributed manner without directly accessing private data on clients. However, FL faces many challenges, including the difficulty to obtain high accuracy, high communication cost between clients and the server, and security attacks related to adversarial ML. To tackle these three challenges, we propose an FL algorithm inspired by evolutionary techniques. The proposed algorithm groups clients randomly in many clusters, each with a model selected randomly to explore the performance of different models. The clusters are then trained in a repetitive process where the worst performing cluster is removed in each iteration until one cluster remains. In each iteration, some clients are expelled from clusters either due to using poisoned data or low performance. The surviving clients are exploited in the next iteration. The remaining cluster with surviving clients is then used for training the best FL model (i.e., remaining FL model). Communication cost is reduced since fewer clients are used in the final training of the FL model. To evaluate the performance of the proposed algorithm, we conduct a number of experiments using FEMNIST dataset and compare the result against the random FL algorithm. The experimental results show that the proposed algorithm outperforms the baseline algorithm in terms of accuracy, communication cost, and security.
    The Geometry of Memoryless Stochastic Policy Optimization in Infinite-Horizon POMDPs. (arXiv:2110.07409v3 [math.OC] UPDATED)
    We consider the problem of finding the best memoryless stochastic policy for an infinite-horizon partially observable Markov decision process (POMDP) with finite state and action spaces with respect to either the discounted or mean reward criterion. We show that the (discounted) state-action frequencies and the expected cumulative reward are rational functions of the policy, whereby the degree is determined by the degree of partial observability. We then describe the optimization problem as a linear optimization problem in the space of feasible state-action frequencies subject to polynomial constraints that we characterize explicitly. This allows us to address the combinatorial and geometric complexity of the optimization problem using recent tools from polynomial optimization. In particular, we estimate the number of critical points and use the polynomial programming description of reward maximization to solve a navigation problem in a grid world.
    Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization. (arXiv:2203.13167v3 [cs.CV] UPDATED)
    In this paper, we investigate the continual learning of Vision Transformers (ViT) for the challenging exemplar-free scenario, with special focus on how to efficiently distill the knowledge of its crucial self-attention mechanism (SAM). Our work takes an initial step towards a surgical investigation of SAM for designing coherent continual learning methods in ViTs. We first carry out an evaluation of established continual learning regularization techniques. We then examine the effect of regularization when applied to two key enablers of SAM: (a) the contextualized embedding layers, for their ability to capture well-scaled representations with respect to the values, and (b) the prescaled attention maps, for carrying value-independent global contextual information. We depict the perks of each distilling strategy on two image recognition benchmarks (CIFAR100 and ImageNet-32) -- while (a) leads to a better overall accuracy, (b) helps enhance the rigidity by maintaining competitive performances. Furthermore, we identify the limitation imposed by the symmetric nature of regularization losses. To alleviate this, we propose an asymmetric variant and apply it to the pooled output distillation (POD) loss adapted for ViTs. Our experiments confirm that introducing asymmetry to POD boosts its plasticity while retaining stability across (a) and (b). Moreover, we acknowledge low forgetting measures for all the compared methods, indicating that ViTs might be naturally inclined continual learner
    Fiber Bundle Morphisms as a Framework for Modeling Many-to-Many Maps. (arXiv:2203.08189v2 [cs.LG] UPDATED)
    While it is not generally reflected in the `nice' datasets used for benchmarking machine learning algorithms, the real-world is full of processes that would be best described as many-to-many. That is, a single input can potentially yield many different outputs (whether due to noise, imperfect measurement, or intrinsic stochasticity in the process) and many different inputs can yield the same output (that is, the map is not injective). For example, imagine a sentiment analysis task where, due to linguistic ambiguity, a single statement can have a range of different sentiment interpretations while at the same time many distinct statements can represent the same sentiment. When modeling such a multivalued function $f: X \rightarrow Y$, it is frequently useful to be able to model the distribution on $f(x)$ for specific input $x$ as well as the distribution on fiber $f^{-1}(y)$ for specific output $y$. Such an analysis helps the user (i) better understand the variance intrinsic to the process they are studying and (ii) understand the range of specific input $x$ that can be used to achieve output $y$. Following existing work which used a fiber bundle framework to better model many-to-one processes, we describe how morphisms of fiber bundles provide a template for building models which naturally capture the structure of many-to-many processes.  ( 2 min )
    Reducing Neural Architecture Search Spaces with Training-Free Statistics and Computational Graph Clustering. (arXiv:2204.14103v1 [cs.LG])
    The computational demands of neural architecture search (NAS) algorithms are usually directly proportional to the size of their target search spaces. Thus, limiting the search to high-quality subsets can greatly reduce the computational load of NAS algorithms. In this paper, we present Clustering-Based REDuction (C-BRED), a new technique to reduce the size of NAS search spaces. C-BRED reduces a NAS space by clustering the computational graphs associated with its architectures and selecting the most promising cluster using proxy statistics correlated with network accuracy. When considering the NAS-Bench-201 (NB201) data set and the CIFAR-100 task, C-BRED selects a subset with 70% average accuracy instead of the whole space's 64% average accuracy.  ( 2 min )
    Abstraction for Deep Reinforcement Learning. (arXiv:2202.05839v3 [cs.LG] UPDATED)
    We characterise the problem of abstraction in the context of deep reinforcement learning. Various well established approaches to analogical reasoning and associative memory might be brought to bear on this issue, but they present difficulties because of the need for end-to-end differentiability. We review developments in AI and machine learning that could facilitate their adoption.
    Unsolved Problems in ML Safety. (arXiv:2109.13916v4 [cs.LG] UPDATED)
    Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. We present four problems ready for research, namely withstanding hazards ("Robustness"), identifying hazards ("Monitoring"), reducing inherent model hazards ("Alignment"), and reducing systemic hazards ("Systemic Safety"). Throughout, we clarify each problem's motivation and provide concrete research directions.
    KnowAugNet: Multi-Source Medical Knowledge Augmented Medication Prediction Network with Multi-Level Graph Contrastive Learning. (arXiv:2204.11736v2 [cs.AI] UPDATED)
    Predicting medications is a crucial task in many intelligent healthcare systems. It can assist doctors in making informed medication decisions for patients according to electronic medical records (EMRs). However, medication prediction is a challenging data mining task due to the complex relations between medical codes. Most existing studies focus on utilizing inherent relations between homogeneous codes of medical ontology graph to enhance their representations using supervised methods, and few studies pay attention to the valuable relations between heterogeneous or homogeneous medical codes from history EMRs, which further limits the prediction performance and application scenarios. Therefore, to address these limitations, this paper proposes KnowAugNet, a multi-sourced medical knowledge augmented medication prediction network which can fully capture the diverse relations between medical codes via multi-level graph contrastive learning framework. Specifically, KnowAugNet first leverages the graph contrastive learning using graph attention network as the encoder to capture the implicit relations between homogeneous medical codes from the medical ontology graph and obtains the knowledge augmented medical codes embedding vectors. Then, it utilizes the graph contrastive learning using a weighted graph convolutional network as the encoder to capture the correlative relations between homogeneous or heterogeneous medical codes from the constructed medical prior relation graph and obtains the relation augmented medical codes embedding vectors. Finally, the augmented medical codes embedding vectors and the supervised medical codes embedding vectors are retrieved and input to the sequential learning network to capture the temporal relations of medical codes and predict medications for patients.  ( 2 min )
    Rockafellian Relaxation in Optimization under Uncertainty: Asymptotically Exact Formulations. (arXiv:2204.04762v2 [math.OC] UPDATED)
    In practice, optimization models are often prone to unavoidable inaccuracies due to lack of data and dubious assumptions. Traditionally, this placed special emphasis on risk-based and robust formulations, and their focus on "conservative" decisions. We develop, in contrast, an "optimistic" framework based on Rockafellian relaxations in which optimization is conducted not only over the original decision space but also jointly with a choice of model perturbation. The framework enables us to address challenging problems with ambiguous probability distributions from the areas of two-stage stochastic optimization without relatively complete recourse, probability functions lacking continuity properties, expectation constraints, and outlier analysis. We are also able to circumvent the fundamental difficulty in stochastic optimization that convergence of distributions fails to guarantee convergence of expectations. The framework centers on the novel concepts of exact and asymptotically exact Rockafellians, with interpretations of "negative" regularization emerging in certain settings. We illustrate the role of Phi-divergence, examine rates of convergence under changing distributions, and explore extensions to first-order optimality conditions. The main development is free of assumptions about convexity, smoothness, and even continuity of objective functions.
    RAMP-CNN: A Novel Neural Network for Enhanced Automotive Radar Object Recognition. (arXiv:2011.08981v2 [eess.SP] UPDATED)
    Millimeter-wave radars are being increasingly integrated into commercial vehicles to support new advanced driver-assistance systems by enabling robust and high-performance object detection, localization, as well as recognition - a key component of new environmental perception. In this paper, we propose a novel radar multiple-perspectives convolutional neural network (RAMP-CNN) that extracts the location and class of objects based on further processing of the range-velocity-angle (RVA) heatmap sequences. To bypass the complexity of 4D convolutional neural networks (NN), we propose to combine several lower-dimension NN models within our RAMP-CNN model that nonetheless approaches the performance upper-bound with lower complexity. The extensive experiments show that the proposed RAMP-CNN model achieves better average recall and average precision than prior works in all testing scenarios. Besides, the RAMP-CNN model is validated to work robustly under nighttime, which enables low-cost radars as a potential substitute for pure optical sensing under severe conditions.  ( 2 min )
    Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation. (arXiv:2202.02440v2 [cs.CV] UPDATED)
    In reinforcement learning for visual navigation, it is common to develop a model for each new task, and train that model from scratch with task-specific interactions in 3D environments. However, this process is expensive; massive amounts of interactions are needed for the model to generalize well. Moreover, this process is repeated whenever there is a change in the task type or the goal modality. We present a unified approach to visual navigation using a novel modular transfer learning model. Our model can effectively leverage its experience from one source task and apply it to multiple target tasks (e.g., ObjectNav, RoomNav, ViewNav) with various goal modalities (e.g., image, sketch, audio, label). Furthermore, our model enables zero-shot experience learning, whereby it can solve the target tasks without receiving any task-specific interactive training. Our experiments on multiple photorealistic datasets and challenging tasks show that our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
    Human's Role in-the-Loop. (arXiv:2204.14192v1 [cs.DB])
    Data integration has been recently challenged by the need to handle large volumes of data, arriving at high velocity from a variety of sources, which demonstrate varying levels of veracity. This challenging setting, often referred to as big data, renders many of the existing techniques, especially those that are human-intensive, obsolete. Big data also produces technological advancements such as Internet of things, cloud computing, and deep learning, and accordingly, provides a new, exciting, and challenging research agenda. Given the availability of data and the improvement of machine learning techniques, this blog discusses the respective roles of humans and machines in achieving cognitive tasks in matching, aiming to determine whether traditional roles of humans and machines are subject to change. Such investigation, we believe, will pave a way to better utilize both human and machine resources in new and innovative manners. We shall discuss two possible modes of change, namely humans out and humans in. Humans out aim at exploring out-of-the-box latent matching reasoning using machine learning algorithms when attempting to overpower human matcher performance. Pursuing out-of-the-box thinking, machine and deep learning can be involved in matching. Humans in explores how to better involve humans in the matching loop by assigning human matchers with a symmetric role to algorithmic matcher in the matching process.  ( 2 min )
    Learning Anisotropic Interaction Rules from Individual Trajectories in a Heterogeneous Cellular Population. (arXiv:2204.14141v1 [q-bio.QM])
    Interacting particle system (IPS) models have proven to be highly successful for describing the spatial movement of organisms. However, it has proven challenging to infer the interaction rules directly from data. In the field of equation discovery, the Weak form Sparse Identification of Nonlinear Dynamics (WSINDy) methodology has been shown to be very computationally efficient for identifying the governing equations of complex systems, even in the presence of substantial noise. Motivated by the success of IPS models to describe the spatial movement of organisms, we develop WSINDy for second order IPSs to model the movement of communities of cells. Specifically, our approach learns the directional interaction rules that govern the dynamics of a heterogeneous population of migrating cells. Rather than aggregating cellular trajectory data into a single best-fit model, we learn the models for each individual cell. These models can then be efficiently classified according to the active classes of interactions present in the model. From these classifications, aggregated models are constructed hierarchically to simultaneously identify different species of cells present in the population and determine best-fit models for each species. We demonstrate the efficiency and proficiency of the method on several test scenarios, motivated by common cell migration experiments.
    Analysing the Influence of Attack Configurations on the Reconstruction of Medical Images in Federated Learning. (arXiv:2204.13808v1 [eess.IV])
    The idea of federated learning is to train deep neural network models collaboratively and share them with multiple participants without exposing their private training data to each other. This is highly attractive in the medical domain due to patients' privacy records. However, a recently proposed method called Deep Leakage from Gradients enables attackers to reconstruct data from shared gradients. This study shows how easy it is to reconstruct images for different data initialization schemes and distance measures. We show how data and model architecture influence the optimal choice of initialization scheme and distance measure configurations when working with single images. We demonstrate that the choice of initialization scheme and distance measure can significantly increase convergence speed and quality. Furthermore, we find that the optimal attack configuration depends largely on the nature of the target image distribution and the complexity of the model architecture.
    Pre-training helps Bayesian optimization too. (arXiv:2109.08215v3 [cs.LG] UPDATED)
    Bayesian optimization (BO) has become a popular strategy for global optimization of many expensive real-world functions. Contrary to a common belief that BO is suited to optimizing black-box functions, it actually requires domain knowledge on characteristics of those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process priors that specify initial beliefs on functions. However, even with expert knowledge, it is not an easy task to select a prior. This is especially true for hyperparameter tuning problems on complex machine learning models, where landscapes of tuning objectives are often difficult to comprehend. We seek an alternative practice for setting these functional priors. In particular, we consider the scenario where we have data from similar functions that allow us to pre-train a tighter distribution a priori. Theoretically, we show a bounded regret of BO with pre-trained priors. To verify our approach in realistic model training setups, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, our method is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods.
    Data+Shift: Supporting visual investigation of data distribution shifts by data scientists. (arXiv:2204.14025v1 [cs.LG])
    Machine learning on data streams is increasingly more present in multiple domains. However, there is often data distribution shift that can lead machine learning models to make incorrect decisions. While there are automatic methods to detect when drift is happening, human analysis, often by data scientists, is essential to diagnose the causes of the problem and adjust the system. We propose Data+Shift, a visual analytics tool to support data scientists in the task of investigating the underlying factors of shift in data features in the context of fraud detection. Design requirements were derived from interviews with data scientists. Data+Shift is integrated with JupyterLab and can be used alongside other data science tools. We validated our approach with a think-aloud experiment where a data scientist used the tool for a fraud detection use case.
    H2H: Heterogeneous Model to Heterogeneous System Mapping with Computation and Communication Awareness. (arXiv:2204.13852v1 [cs.LG])
    The complex nature of real-world problems calls for heterogeneity in both machine learning (ML) models and hardware systems. The heterogeneity in ML models comes from multi-sensor perceiving and multi-task learning, i.e., multi-modality multi-task (MMMT), resulting in diverse deep neural network (DNN) layers and computation patterns. The heterogeneity in systems comes from diverse processing components, as it becomes the prevailing method to integrate multiple dedicated accelerators into one system. Therefore, a new problem emerges: heterogeneous model to heterogeneous system mapping (H2H). While previous mapping algorithms mostly focus on efficient computations, in this work, we argue that it is indispensable to consider computation and communication simultaneously for better system efficiency. We propose a novel H2H mapping algorithm with both computation and communication awareness; by slightly trading computation for communication, the system overall latency and energy consumption can be largely reduced. The superior performance of our work is evaluated based on MAESTRO modeling, demonstrating 15%-74% latency reduction and 23%-64% energy reduction compared with existing computation-prioritized mapping algorithms.  ( 2 min )
    Controlled Generation of Unseen Faults for Partial and OpenSet&Partial Domain Adaptation. (arXiv:2204.14068v1 [cs.LG])
    New operating conditions can result in a performance drop of fault diagnostics models due to the domain gap between the training and the testing data distributions. While several domain adaptation approaches have been proposed to overcome such domain shifts, their application is limited if the label spaces of the two domains are not congruent. To improve the transferability of the trained models, particularly in setups where only the healthy data class is shared between the two domains, we propose a new framework based on a Wasserstein GAN for Partial and OpenSet&Partial domain adaptation. The main contribution is the controlled fault data generation that enables to generate unobserved fault types and severity levels in the target domain by having only access to the healthy samples in the target domain and faulty samples in the source domain. To evaluate the ability of the proposed method to bridge domain gaps in different domain adaption settings, we conduct Partial as well as OpenSet&Partial domain adaptation experiments on two bearing fault diagnostics case studies. The results show the versatility of the framework and that the synthetically generated fault data helps bridging the domain gaps, especially in instances where the domain gap is large.
    Towards Generalizable Semantic Product Search by Text Similarity Pre-training on Search Click Logs. (arXiv:2204.05231v2 [cs.IR] UPDATED)
    Recently, semantic search has been successfully applied to e-commerce product search and the learned semantic space(s) for query and product encoding are expected to generalize to unseen queries or products. Yet, whether generalization can conveniently emerge has not been thoroughly studied in the domain thus far. In this paper, we examine several general-domain and domain-specific pre-trained Roberta variants and discover that general-domain fine-tuning does not help generalization, which aligns with the discovery of prior art. Proper domain-specific fine-tuning with clickstream data can lead to better model generalization, based on a bucketed analysis of a publicly available manual annotated query-product pair da  ( 2 min )
    Recommendations on test datasets for evaluating AI solutions in pathology. (arXiv:2204.14226v1 [eess.IV])
    Artificial intelligence (AI) solutions that automatically extract information from digital histology images have shown great promise for improving pathological diagnosis. Prior to routine use, it is important to evaluate their predictive performance and obtain regulatory approval. This assessment requires appropriate test datasets. However, compiling such datasets is challenging and specific recommendations are missing. A committee of various stakeholders, including commercial AI developers, pathologists, and researchers, discussed key aspects and conducted extensive literature reviews on test datasets in pathology. Here, we summarize the results and derive general recommendations for the collection of test datasets. We address several questions: Which and how many images are needed? How to deal with low-prevalence subsets? How can potential bias be detected? How should datasets be reported? What are the regulatory requirements in different countries? The recommendations are intended to help AI developers demonstrate the utility of their products and to help regulatory agencies and end users verify reported performance measures. Further research is needed to formulate criteria for sufficiently representative test datasets so that AI solutions can operate with less user intervention and better support diagnostic workflows in the future.  ( 2 min )
    Semi-Discrete Optimal Transport: Hardness, Regularization and Numerical Solution. (arXiv:2103.06263v2 [cs.LG] UPDATED)
    Semi-discrete optimal transport problems, which evaluate the Wasserstein distance between a discrete and a generic (possibly non-discrete) probability measure, are believed to be computationally hard. Even though such problems are ubiquitous in statistics, machine learning and computer vision, however, this perception has not yet received a theoretical justification. To fill this gap, we prove that computing the Wasserstein distance between a discrete probability measure supported on two points and the Lebesgue measure on the standard hypercube is already #P-hard. This insight prompts us to seek approximate solutions for semi-discrete optimal transport problems. We thus perturb the underlying transportation cost with an additive disturbance governed by an ambiguous probability distribution, and we introduce a distributionally robust dual optimal transport problem whose objective function is smoothed with the most adverse disturbance distributions from within a given ambiguity set. We further show that smoothing the dual objective function is equivalent to regularizing the primal objective function, and we identify several ambiguity sets that give rise to several known and new regularization schemes. As a byproduct, we discover an intimate relation between semi-discrete optimal transport problems and discrete choice models traditionally studied in psychology and economics. To solve the regularized optimal transport problems efficiently, we use a stochastic gradient descent algorithm with imprecise stochastic gradient oracles. A new convergence analysis reveals that this algorithm improves the best known convergence guarantee for semi-discrete optimal transport problems with entropic regularizers.
    Testing the Generalization of Neural Language Models for COVID-19 Misinformation Detection. (arXiv:2111.07819v3 [cs.CL] UPDATED)
    A drastic rise in potentially life-threatening misinformation has been a by-product of the COVID-19 pandemic. Computational support to identify false information within the massive body of data on the topic is crucial to prevent harm. Researchers proposed many methods for flagging online misinformation related to COVID-19. However, these methods predominantly target specific content types (e.g., news) or platforms (e.g., Twitter). The methods' capabilities to generalize were largely unclear so far. We evaluate fifteen Transformer-based models on five COVID-19 misinformation datasets that include social media posts, news articles, and scientific papers to fill this gap. We show tokenizers and models tailored to COVID-19 data do not provide a significant advantage over general-purpose ones. Our study provides a realistic assessment of models for detecting COVID-19 misinformation. We expect that evaluating a broad spectrum of datasets and models will benefit future research in developing misinformation detection systems.
    To Trust or Not To Trust Prediction Scores for Membership Inference Attacks. (arXiv:2111.09076v2 [cs.LG] UPDATED)
    Membership inference attacks (MIAs) aim to determine whether a specific sample was used to train a predictive model. Knowing this may indeed lead to a privacy breach. Most MIAs, however, make use of the model's prediction scores - the probability of each output given some input - following the intuition that the trained model tends to behave differently on its training data. We argue that this is a fallacy for many modern deep network architectures. Consequently, MIAs will miserably fail since overconfidence leads to high false-positive rates not only on known domains but also on out-of-distribution data and implicitly acts as a defense against MIAs. Specifically, using generative adversarial networks, we are able to produce a potentially infinite number of samples falsely classified as part of the training data. In other words, the threat of MIAs is overestimated, and less information is leaked than previously assumed. Moreover, there is actually a trade-off between the overconfidence of models and their susceptibility to MIAs: the more classifiers know when they do not know, making low confidence predictions, the more they reveal the training data.
    Autonomous In-Situ Soundscape Augmentation via Joint Selection of Masker and Gain. (arXiv:2204.13883v1 [eess.AS])
    The selection of maskers and playback gain levels in a soundscape augmentation system is crucial to its effectiveness in improving the overall acoustic comfort of a given environment. Traditionally, the selection of appropriate maskers and gain levels has been informed by expert opinion, which may not representative of the target population, or by listening tests, which can be time-consuming and labour-intensive. Furthermore, the resulting static choices of masker and gain are often inflexible to the dynamic nature of real-world soundscapes. In this work, we utilized a deep learning model to perform joint selection of the optimal masker and its gain level for a given soundscape. The proposed model was designed with highly modular building blocks, allowing for an optimized inference process that can quickly search through a large number of masker and gain combinations. In addition, we introduced the use of feature-domain soundscape augmentation conditioned on the digital gain level, eliminating the computationally expensive waveform-domain mixing process during inference time, as well as the tedious pre-calibration process required for new maskers. The proposed system was validated on a large-scale dataset of subjective responses to augmented soundscapes with more than 440 participants, ensuring the ability of the model to predict combined effect of the masker and its gain level on the perceptual pleasantness level.  ( 2 min )
    Wide and Deep Neural Networks Achieve Optimality for Classification. (arXiv:2204.14126v1 [cs.LG])
    While neural networks are used for classification tasks across domains, a long-standing open problem in machine learning is determining whether neural networks trained using standard procedures are optimal for classification, i.e., whether such models minimize the probability of misclassification for arbitrary data distributions. In this work, we identify and construct an explicit set of neural network classifiers that achieve optimality. Since effective neural networks in practice are typically both wide and deep, we analyze infinitely wide networks that are also infinitely deep. In particular, using the recent connection between infinitely wide neural networks and Neural Tangent Kernels, we provide explicit activation functions that can be used to construct networks that achieve optimality. Interestingly, these activation functions are simple and easy to implement, yet differ from commonly used activations such as ReLU or sigmoid. More generally, we create a taxonomy of infinitely wide and deep networks and show that these models implement one of three well-known classifiers depending on the activation function used: (1) 1-nearest neighbor (model predictions are given by the label of the nearest training example); (2) majority vote (model predictions are given by the label of the class with greatest representation in the training set); or (3) singular kernel classifiers (a set of classifiers containing those that achieve optimality). Our results highlight the benefit of using deep networks for classification tasks, in contrast to regression tasks, where excessive depth is harmful.
    A Framework for Constructing Machine Learning Models with Feature Set Optimisation for Evapotranspiration Partitioning. (arXiv:2204.14142v1 [cs.LG])
    A deeper understanding of the drivers of evapotranspiration and the modelling of its constituent parts (evaporation and transpiration) could be of significant importance to the monitoring and management of water resources globally over the coming decades. In this work, we developed a framework to identify the best performing machine learning algorithm from a candidate set, select optimal predictive features as well as ranking features in terms of their im- portance to predictive accuracy. Our experiments used 3 separate feature sets across 4 wetland sites as input into 8 candidate machine learning algorithms, providing 96 sets of experimental configurations. Given this high number of parameters, our results show strong evidence that there is no singularly optimal machine learning algorithm or feature set across all of the wetland sites studied despite their similarities. A key finding discovered when examining feature importance is that methane flux, a feature whose relationship with evapotranspiration is not generally examined, may contribute to further biophysical process understanding.
    Adversarially-regularized mixed effects deep learning (ARMED) models for improved interpretability, performance, and generalization on clustered data. (arXiv:2202.11783v3 [cs.LG] UPDATED)
    Natural science datasets frequently violate assumptions of independence. Samples may be clustered (e.g. by study site, subject, or experimental batch), leading to spurious associations, poor model fitting, and confounded analyses. While largely unaddressed in deep learning, this problem has been handled in the statistics community through mixed effects models, which separate cluster-invariant fixed effects from cluster-specific random effects. We propose a general-purpose framework for Adversarially-Regularized Mixed Effects Deep learning (ARMED) models through non-intrusive additions to existing neural networks: 1) an adversarial classifier constraining the original model to learn only cluster-invariant features, 2) a random effects subnetwork capturing cluster-specific features, and 3) an approach to apply random effects to clusters unseen during training. We apply ARMED to dense, convolutional, and autoencoder neural networks on 4 applications including simulated nonlinear data, dementia prognosis and diagnosis, and live-cell image analysis. Compared to prior techniques, ARMED models better distinguish confounded from true associations in simulations and learn more biologically plausible features in clinical applications. They can also quantify inter-cluster variance and visualize cluster effects in data. Finally, ARMED improves accuracy on data from clusters seen during training (up to 28% vs. conventional models) and generalization to unseen clusters (up to 9% vs. conventional models).
    COVID-Net US-X: Enhanced Deep Neural Network for Detection of COVID-19 Patient Cases from Convex Ultrasound Imaging Through Extended Linear-Convex Ultrasound Augmentation Learning. (arXiv:2204.13851v1 [eess.IV])
    As the global population continues to face significant negative impact by the on-going COVID-19 pandemic, there has been an increasing usage of point-of-care ultrasound (POCUS) imaging as a low-cost and effective imaging modality of choice in the COVID-19 clinical workflow. A major barrier with widespread adoption of POCUS in the COVID-19 clinical workflow is the scarcity of expert clinicians that can interpret POCUS examinations, leading to considerable interest in deep learning-driven clinical decision support systems to tackle this challenge. A major challenge to building deep neural networks for COVID-19 screening using POCUS is the heterogeneity in the types of probes used to capture ultrasound images (e.g., convex vs. linear probes), which can lead to very different visual appearances. In this study, we explore the impact of leveraging extended linear-convex ultrasound augmentation learning on producing enhanced deep neural networks for COVID-19 assessment, where we conduct data augmentation on convex probe data alongside linear probe data that have been transformed to better resemble convex probe data. Experimental results using an efficient deep columnar anti-aliased convolutional neural network designed via a machined-driven design exploration strategy (which we name COVID-Net US-X) show that the proposed extended linear-convex ultrasound augmentation learning significantly increases performance, with a gain of 5.1% in test accuracy and 13.6% in AUC.
    PokeBNN: A Binary Pursuit of Lightweight Accuracy. (arXiv:2112.00133v2 [cs.LG] UPDATED)
    Optimization of Top-1 ImageNet promotes enormous networks that may be impractical in inference settings. Binary neural networks (BNNs) have the potential to significantly lower the compute intensity but existing models suffer from low quality. To overcome this deficiency, we propose PokeConv, a binary convolution block which improves quality of BNNs by techniques such as adding multiple residual paths, and tuning the activation function. We apply it to ResNet-50 and optimize ResNet's initial convolutional layer which is hard to binarize. We name the resulting network family PokeBNN. These techniques are chosen to yield favorable improvements in both top-1 accuracy and the network's cost. In order to enable joint optimization of the cost together with accuracy, we define arithmetic computation effort (ACE), a hardware- and energy-inspired cost metric for quantized and binarized networks. We also identify a need to optimize an under-explored hyper-parameter controlling the binarization gradient approximation. We establish a new, strong state-of-the-art (SOTA) on top-1 accuracy together with commonly-used CPU64 cost, ACE cost and network size metrics. ReActNet-Adam, the previous SOTA in BNNs, achieved a 70.5% top-1 accuracy with 7.9 ACE. A small variant of PokeBNN achieves 70.5% top-1 with 2.6 ACE, more than 3x reduction in cost; a larger PokeBNN achieves 75.6% top-1 with 7.8 ACE, more than 5% improvement in accuracy without increasing the cost. PokeBNN implementation in JAX/Flax and reproduction instructions are available in AQT repository: https://github.com/google/aqt
    Altruist: Argumentative Explanations through Local Interpretations of Predictive Models. (arXiv:2010.07650v2 [cs.LG] UPDATED)
    Explainable AI is an emerging field providing solutions for acquiring insights into automated systems' rationale. It has been put on the AI map by suggesting ways to tackle key ethical and societal issues. Existing explanation techniques are often not comprehensible to the end user. Lack of evaluation and selection criteria also makes it difficult for the end user to choose the most suitable technique. In this study, we combine logic-based argumentation with Interpretable Machine Learning, introducing a preliminary meta-explanation methodology that identifies the truthful parts of feature importance oriented interpretations. This approach, in addition to being used as a meta-explanation technique, can be used as an evaluation or selection tool for multiple feature importance techniques. Experimentation strongly indicates that an ensemble of multiple interpretation techniques yields considerably more truthful explanations.
    Identifying Critical LMS Features for Predicting At-risk Students. (arXiv:2204.13700v1 [cs.LG])
    Learning management systems (LMSs) have become essential in higher education and play an important role in helping educational institutions to promote student success. Traditionally, LMSs have been used by postsecondary institutions in administration, reporting, and delivery of educational content. In this paper, we present an additional use of LMS by using its data logs to perform data-analytics and identify academically at-risk students. The data-driven insights would allow educational institutions and educators to develop and implement pedagogical interventions targeting academically at-risk students. We used anonymized data logs created by Brightspace LMS during fall 2019, spring 2020, and fall 2020 semesters at our college. Supervised machine learning algorithms were used to predict the final course performance of students, and several algorithms were found to perform well with accuracy above 90%. SHAP value method was used to assess the relative importance of features used in the predictive models. Unsupervised learning was also used to group students into different clusters based on the similarities in their interaction/involvement with LMS. In both of supervised and unsupervised learning, we identified two most-important features (Number_Of_Assignment_Submissions and Content_Completed). More importantly, our study lays a foundation and provides a framework for developing a real-time data analytics metric that may be incorporated into a LMS.  ( 2 min )
    Multimodal Transformer-based Model for Buchwald-Hartwig and Suzuki-Miyaura Reaction Yield Prediction. (arXiv:2204.14062v1 [cs.LG])
    Predicting the yield percentage of a chemical reaction is useful in many aspects such as reducing wet-lab experimentation by giving the priority to the reactions with a high predicted yield. In this work we investigated the use of multiple type inputs to predict chemical reaction yield. We used simplified molecular-input line-entry system (SMILES) as well as calculated chemical descriptors as model inputs. The model consists of a pre-trained bidirectional transformer-based encoder (BERT) and a multi-layer perceptron (MLP) with a regression head to predict the yield. We experimented on two high throughput experimentation (HTE) datasets for Buchwald-Hartwig and Suzuki-Miyaura reactions. The experiments show improvements in the prediction on both datasets compared to systems using only SMILES or chemical descriptors as input. We also tested the model's performance on out-of-sample dataset splits of Buchwald-Hartwig and achieved comparable results with the state-of-the-art. In addition to predicting the yield, we demonstrated the model's ability to suggest the optimum (highest yield) reaction conditions. The model was able to suggest conditions that achieves 94% of the optimum reported yields. This proves the model to be useful in achieving the best results in the wet lab without expensive experimentation.
    Few-shot learning for medical text: A systematic review. (arXiv:2204.14081v1 [cs.CL])
    Objective: Few-shot learning (FSL) methods require small numbers of labeled instances for training. As many medical topics have limited annotated textual data in practical settings, FSL-based natural language processing (NLP) methods hold substantial promise. We aimed to conduct a systematic review to explore the state of FSL methods for medical NLP. Materials and Methods: We searched for articles published between January 2016 and August 2021 using PubMed/Medline, Embase, ACL Anthology, and IEEE Xplore Digital Library. To identify the latest relevant methods, we also searched other sources such as preprint servers (eg., medRxiv) via Google Scholar. We included all articles that involved FSL and any type of medical text. We abstracted articles based on data source(s), aim(s), training set size(s), primary method(s)/approach(es), and evaluation method(s). Results: 31 studies met our inclusion criteria-all published after 2018; 22 (71%) since 2020. Concept extraction/named entity recognition was the most frequently addressed task (13/31; 42%), followed by text classification (10/31; 32%). Twenty-one (68%) studies reconstructed existing datasets to create few-shot scenarios synthetically, and MIMIC-III was the most frequently used dataset (7/31; 23%). Common methods included FSL with attention mechanisms (12/31; 39%), prototypical networks (8/31; 26%), and meta-learning (6/31; 19%). Discussion: Despite the potential for FSL in biomedical NLP, progress has been limited compared to domain-independent FSL. This may be due to the paucity of standardized, public datasets, and the relative underperformance of FSL methods on biomedical topics. Creation and release of specialized datasets for biomedical FSL may aid method development by enabling comparative analyses.
    MarkovGNN: Graph Neural Networks on Markov Diffusion. (arXiv:2202.02470v2 [cs.LG] UPDATED)
    Most real-world networks contain well-defined community structures where nodes are densely connected internally within communities. To learn from these networks, we develop MarkovGNN that captures the formation and evolution of communities directly in different convolutional layers. Unlike most Graph Neural Networks (GNNs) that consider a static graph at every layer, MarkovGNN generates different stochastic matrices using a Markov process and then uses these community-capturing matrices in different layers. MarkovGNN is a general approach that could be used with most existing GNNs. We experimentally show that MarkovGNN outperforms other GNNs for clustering, node classification, and visualization tasks. The source code of MarkovGNN is publicly available at \url{https://github.com/HipGraph/MarkovGNN}.
    GCN-FFNN: A Two-Stream Deep Model for Learning Solution to Partial Differential Equations. (arXiv:2204.13744v1 [cs.LG])
    This paper introduces a novel two-stream deep model based on graph convolutional network (GCN) architecture and feed-forward neural networks (FFNN) for learning the solution of nonlinear partial differential equations (PDEs). The model aims at incorporating both graph and grid input representations using two streams corresponding to GCN and FFNN models, respectively. Each stream layer receives and processes its own input representation. As opposed to FFNN which receives a grid-like structure, the GCN stream layer operates on graph input data where the neighborhood information is incorporated through the adjacency matrix of the graph. In this way, the proposed GCN-FFNN model learns from two types of input representations, i.e. grid and graph data, obtained via the discretization of the PDE domain. The GCN-FFNN model is trained in two phases. In the first phase, the model parameters of each stream are trained separately. Both streams employ the same error function to adjust their parameters by enforcing the models to satisfy the given PDE as well as its initial and boundary conditions on grid or graph collocation (training) data. In the second phase, the learned parameters of two-stream layers are frozen and their learned representation solutions are fed to fully connected layers whose parameters are learned using the previously used error function. The learned GCN-FFNN model is tested on test data located both inside and outside the PDE domain. The obtained numerical results demonstrate the applicability and efficiency of the proposed GCN-FFNN model over individual GCN and FFNN models on 1D-Burgers, 1D-Schr\"odinger, 2D-Burgers and 2D-Schr\"odinger equations.
    ROS-X-Habitat: Bridging the ROS Ecosystem with Embodied AI. (arXiv:2109.07703v3 [cs.RO] UPDATED)
    We introduce ROS-X-Habitat, a software interface that bridges the AI Habitat platform for embodied learning-based agents with other robotics resources via ROS. This interface not only offers standardized communication protocols between embodied agents and simulators, but also enables physically and photorealistic simulation that benefits the training and/or testing of vision-based embodied agents. With this interface, roboticists can evaluate their own Habitat RL agents in another ROS-based simulator or use Habitat Sim v2 as the test bed for their own robotic algorithms. Through in silico experiments, we demonstrate that ROS-X-Habitat has minimal impact on the navigation performance and simulation speed of a Habitat RGBD agent; that a standard set of ROS mapping, planning and navigation tools can run in Habitat Sim v2; and that a Habitat agent can run in the standard ROS simulator Gazebo.
    A study of tree-based methods and their combination. (arXiv:2204.13916v1 [stat.ML])
    Tree-based methods are popular machine learning techniques used in various fields. In this work, we review their foundations and a general framework the importance sampled learning ensemble (ISLE) that accelerates their fitting process. Furthermore, we describe a model combination strategy called the adaptive regression by mixing (ARM), which is feasible for tree- based methods via ISLE. Moreover, three modified ISLEs are proposed, and their performance are evaluated on the real data sets.
    Randomized Smoothing under Attack: How Good is it in Pratice?. (arXiv:2204.14187v1 [cs.CR])
    Randomized smoothing is a recent and celebrated solution to certify the robustness of any classifier. While it indeed provides a theoretical robustness against adversarial attacks, the dimensionality of current classifiers necessarily imposes Monte Carlo approaches for its application in practice. This paper questions the effectiveness of randomized smoothing as a defense, against state of the art black-box attacks. This is a novel perspective, as previous research works considered the certification as an unquestionable guarantee. We first formally highlight the mismatch between a theoretical certification and the practice of attacks on classifiers. We then perform attacks on randomized smoothing as a defense. Our main observation is that there is a major mismatch in the settings of the RS for obtaining high certified robustness or when defeating black box attacks while preserving the classifier accuracy.
    Cost Effective MLaaS Federation: A Combinatorial Reinforcement Learning Approach. (arXiv:2204.13971v1 [cs.LG])
    With the advancement of deep learning techniques, major cloud providers and niche machine learning service providers start to offer their cloud-based machine learning tools, also known as machine learning as a service (MLaaS), to the public. According to our measurement, for the same task, these MLaaSes from different providers have varying performance due to the proprietary datasets, models, etc. Federating different MLaaSes together allows us to improve the analytic performance further. However, naively aggregating results from different MLaaSes not only incurs significant momentary cost but also may lead to sub-optimal performance gain due to the introduction of possible false-positive results. In this paper, we propose Armol, a framework to federate the right selection of MLaaS providers to achieve the best possible analytic performance. We first design a word grouping algorithm to unify the output labels across different providers. We then present a deep combinatorial reinforcement learning based-approach to maximize the accuracy while minimizing the cost. The predictions from the selected providers are then aggregated together using carefully chosen ensemble strategies. The real-world trace-driven evaluation further demonstrates that Armol is able to achieve the same accuracy results with $67\%$ less inference cost.
    Topological Data Analysis in Time Series: Temporal Filtration and Application to Single-Cell Genomics. (arXiv:2204.14048v1 [cs.LG])
    The absence of a conventional association between the cell-cell cohabitation and its emergent dynamics into cliques during development has hindered our understanding of how cell populations proliferate, differentiate, and compete, i.e. the cell ecology. With the recent advancement of the single-cell RNA-sequencing (RNA-seq), we can potentially describe such a link by constructing network graphs that characterize the similarity of the gene expression profiles of the cell-specific transcriptional programs, and analyzing these graphs systematically using the summary statistics informed by the algebraic topology. We propose the single-cell topological simplicial analysis (scTSA). Applying this approach to the single-cell gene expression profiles from local networks of cells in different developmental stages with different outcomes reveals a previously unseen topology of cellular ecology. These networks contain an abundance of cliques of single-cell profiles bound into cavities that guide the emergence of more complicated habitation forms. We visualize these ecological patterns with topological simplicial architectures of these networks, compared with the null models. Benchmarked on the single-cell RNA-seq data of zebrafish embryogenesis spanning 38,731 cells, 25 cell types and 12 time steps, our approach highlights the gastrulation as the most critical stage, consistent with consensus in developmental biology. As a nonlinear, model-independent, and unsupervised framework, our approach can also be applied to tracing multi-scale cell lineage, identifying critical stages, or creating pseudo-time series.
    No Task Left Behind: Multi-Task Learning of Knowledge Tracing and Option Tracing for Better Student Assessment. (arXiv:2204.14006v1 [cs.CY])
    Student assessment is one of the most fundamental tasks in the field of AI Education (AIEd). One of the most common approach to student assessment is Knowledge Tracing (KT), which evaluates a student's knowledge state by predicting whether the student will answer a given question correctly or not. However, in the context of multiple choice (polytomous) questions, conventional KT approaches are limited in that they only consider the binary (dichotomous) correctness label (i.e., correct or incorrect), and disregard the specific option chosen by the student. Meanwhile, Option Tracing (OT) attempts to model a student by predicting which option they will choose for a given question, but overlooks the correctness information. In this paper, we propose Dichotomous-Polytomous Multi-Task Learning (DP-MTL), a multi-task learning framework that combines KT and OT for more precise student assessment. In particular, we show that the KT objective acts as a regularization term for OT in the DP-MTL framework, and propose an appropriate architecture for applying our method on top of existing deep learning-based KT models. We experimentally confirm that DP-MTL significantly improves both KT and OT performances, and also benefits downstream tasks such as Score Prediction (SP).
    RoSA: A Robust Self-Aligned Framework for Node-Node Graph Contrastive Learning. (arXiv:2204.13846v1 [cs.LG])
    Graph contrastive learning has gained significant progress recently. However, existing works have rarely explored non-aligned node-node contrasting. In this paper, we propose a novel graph contrastive learning method named RoSA that focuses on utilizing non-aligned augmented views for node-level representation learning. First, we leverage the earth mover's distance to model the minimum effort to transform the distribution of one view to the other as our contrastive objective, which does not require alignment between views. Then we introduce adversarial training as an auxiliary method to increase sampling diversity and enhance the robustness of our model. Experimental results show that RoSA outperforms a series of graph contrastive learning frameworks on homophilous, non-homophilous and dynamic graphs, which validates the effectiveness of our work. To the best of our awareness, RoSA is the first work focuses on the non-aligned node-node graph contrastive learning problem. Our codes are available at: \href{https://github.com/ZhuYun97/RoSA}{\texttt{https://github.com/ZhuYun97/RoSA}}
    Formulating Robustness Against Unforeseen Attacks. (arXiv:2204.13779v1 [cs.LG])
    Existing defenses against adversarial examples such as adversarial training typically assume that the adversary will conform to a specific or known threat model, such as $\ell_p$ perturbations within a fixed budget. In this paper, we focus on the scenario where there is a mismatch in the threat model assumed by the defense during training, and the actual capabilities of the adversary at test time. We ask the question: if the learner trains against a specific "source" threat model, when can we expect robustness to generalize to a stronger unknown "target" threat model during test-time? Our key contribution is to formally define the problem of learning and generalization with an unforeseen adversary, which helps us reason about the increase in adversarial risk from the conventional perspective of a known adversary. Applying our framework, we derive a generalization bound which relates the generalization gap between source and target threat models to variation of the feature extractor, which measures the expected maximum difference between extracted features across a given threat model. Based on our generalization bound, we propose adversarial training with variation regularization (AT-VR) which reduces variation of the feature extractor across the source threat model during training. We empirically demonstrate that AT-VR can lead to improved generalization to unforeseen attacks during test-time compared to standard adversarial training on Gaussian and image datasets.
    Making sense of violence risk predictions using clinical notes. (arXiv:2204.13976v1 [cs.LG])
    Violence risk assessment in psychiatric institutions enables interventions to avoid violence incidents. Clinical notes written by practitioners and available in electronic health records (EHR) are valuable resources that are seldom used to their full potential. Previous studies have attempted to assess violence risk in psychiatric patients using such notes, with acceptable performance. However, they do not explain why classification works and how it can be improved. We explore two methods to better understand the quality of a classifier in the context of clinical note analysis: random forests using topic models, and choice of evaluation metric. These methods allow us to understand both our data and our methodology more profoundly, setting up the groundwork to work on improved models that build upon this understanding. This is particularly important when it comes to the generalizability of evaluated classifiers to new data, a trustworthiness problem that is of great interest due to the increased availability of new data in electronic format.
    Backdoor Attacks in Federated Learning by Rare Embeddings and Gradient Ensembling. (arXiv:2204.14017v1 [cs.LG])
    Recent advances in federated learning have demonstrated its promising capability to learn on decentralized datasets. However, a considerable amount of work has raised concerns due to the potential risks of adversaries participating in the framework to poison the global model for an adversarial purpose. This paper investigates the feasibility of model poisoning for backdoor attacks through \textit{rare word embeddings of NLP models} in text classification and sequence-to-sequence tasks. In text classification, less than 1\% of adversary clients suffices to manipulate the model output without any drop in the performance of clean sentences. For a less complex dataset, a mere 0.1\% of adversary clients is enough to poison the global model effectively. We also propose a technique specialized in the federated learning scheme called gradient ensemble, which enhances the backdoor performance in all experimental settings.
    Explainable AI via Learning to Optimize. (arXiv:2204.14174v1 [math.OC])
    Indecipherable black boxes are common in machine learning (ML), but applications increasingly require explainable artificial intelligence (XAI). The core of XAI is to establish transparent and interpretable data-driven algorithms. This work provides concrete tools for XAI in situations where prior knowledge must be encoded and untrustworthy inferences flagged. We use the "learn to optimize" (L2O) methodology wherein each inference solves a data-driven optimization problem. Our L2O models are straightforward to implement, directly encode prior knowledge, and yield theoretical guarantees (e.g. satisfaction of constraints). We also propose use of interpretable certificates to verify whether model inferences are trustworthy. Numerical examples are provided in the applications of dictionary-based signal recovery, CT imaging, and arbitrage trading of cryptoassets.
    Fix the Noise: Disentangling Source Feature for Transfer Learning of StyleGAN. (arXiv:2204.14079v1 [cs.CV])
    Transfer learning of StyleGAN has recently shown great potential to solve diverse tasks, especially in domain translation. Previous methods utilized a source model by swapping or freezing weights during transfer learning, however, they have limitations on visual quality and controlling source features. In other words, they require additional models that are computationally demanding and have restricted control steps that prevent a smooth transition. In this paper, we propose a new approach to overcome these limitations. Instead of swapping or freezing, we introduce a simple feature matching loss to improve generation quality. In addition, to control the degree of source features, we train a target model with the proposed strategy, FixNoise, to preserve the source features only in a disentangled subspace of a target feature space. Owing to the disentangled feature space, our method can smoothly control the degree of the source features in a single model. Extensive experiments demonstrate that the proposed method can generate more consistent and realistic images than previous works.
    Federated Learning: Balancing the Thin Line Between Data Intelligence and Privacy. (arXiv:2204.13697v1 [cs.LG])
    Federated learning holds great promise in learning from fragmented sensitive data and has revolutionized how machine learning models are trained. This article provides a systematic overview and detailed taxonomy of federated learning. We investigate the existing security challenges in federated learning and provide a comprehensive overview of established defense techniques for data poisoning, inference attacks, and model poisoning attacks. The work also presents an overview of current training challenges for federated learning, focusing on handling non-i.i.d. data, high dimensionality issues, and heterogeneous architecture, and discusses several solutions for the associated challenges. Finally, we discuss the remaining challenges in managing federated learning training and suggest focused research directions to address the open questions. Potential candidate areas for federated learning, including IoT ecosystem, healthcare applications, are discussed with a particular focus on banking and financial domains.
    CATNet: Cross-event Attention-based Time-aware Network for Medical Event Prediction. (arXiv:2204.13847v1 [cs.LG])
    Medical event prediction (MEP) is a fundamental task in the medical domain, which needs to predict medical events, including medications, diagnosis codes, laboratory tests, procedures, outcomes, and so on, according to historical medical records. The task is challenging as medical data is a type of complex time series data with heterogeneous and temporal irregular characteristics. Many machine learning methods that consider the two characteristics have been proposed for medical event prediction. However, most of them consider the two characteristics separately and ignore the correlations among different types of medical events, especially relations between historical medical events and target medical events. In this paper, we propose a novel neural network based on attention mechanism, called cross-event attention-based time-aware network (CATNet), for medical event prediction. It is a time-aware, event-aware and task-adaptive method with the following advantages: 1) modeling heterogeneous information and temporal information in a unified way and considering temporal irregular characteristics locally and globally respectively, 2) taking full advantage of correlations among different types of events via cross-event attention. Experiments on two public datasets (MIMIC-III and eICU) show CATNet can be adaptive with different MEP tasks and outperforms other state-of-the-art methods on various MEP tasks. The source code of CATNet will be released after this manuscript is accepted.
    VPNets: Volume-preserving neural networks for learning source-free dynamics. (arXiv:2204.13843v1 [cs.LG])
    We propose volume-preserving networks (VPNets) for learning unknown source-free dynamical systems using trajectory data. We propose three modules and combine them to obtain two network architectures, coined R-VPNet and LA-VPNet. The distinct feature of the proposed models is that they are intrinsic volume-preserving. In addition, the corresponding approximation theorems are proved, which theoretically guarantee the expressivity of the proposed VPNets to learn source-free dynamics. The effectiveness, generalization ability and structure-preserving property of the VP-Nets are demonstrated by numerical experiments.
    Short-Term Density Forecasting of Low-Voltage Load using Bernstein-Polynomial Normalizing Flows. (arXiv:2204.13939v1 [cs.LG])
    The transition to a fully renewable energy grid requires better forecasting of demand at the low-voltage level to increase efficiency and ensure reliable control. However, high fluctuations and increasing electrification cause huge forecast variability, not reflected in traditional point estimates. Probabilistic load forecasts take future uncertainties into account and thus allow more informed decision-making for the planning and operation of low-carbon energy systems. We propose an approach for flexible conditional density forecasting of short-term load based on Bernstein polynomial normalizing flows, where a neural network controls the parameters of the flow. In an empirical study with 363 smart meter customers, our density predictions compare favorably against Gaussian and Gaussian mixture densities. Also, they outperform a non-parametric approach based on the pinball loss for 24h-ahead load forecasting for two different neural network architectures.
    Learning from Natural Language Feedback. (arXiv:2204.14146v1 [cs.CL])
    Pretrained language models often do not perform tasks in ways that are in line with our preferences, e.g., generating offensive text or factually incorrect summaries. Recent work approaches the above issue by learning from a simple form of human evaluation: comparisons between pairs of model-generated task outputs. Comparison feedback conveys limited information about human preferences per human evaluation. Here, we propose to learn from natural language feedback, which conveys more information per human evaluation. We learn from language feedback on model outputs using a three-step learning algorithm. First, we condition the language model on the initial output and feedback to generate many refinements. Second, we choose the refinement with the highest similarity to the feedback. Third, we finetune a language model to maximize the likelihood of the chosen refinement given the input. In synthetic experiments, we first evaluate whether language models accurately incorporate feedback to produce refinements, finding that only large language models (175B parameters) do so. Using only 100 samples of human-written feedback, our learning algorithm finetunes a GPT-3 model to roughly human-level summarization.
    High Dimensional Bayesian Optimization with Kernel Principal Component Analysis. (arXiv:2204.13753v1 [cs.LG])
    Bayesian Optimization (BO) is a surrogate-based global optimization strategy that relies on a Gaussian Process regression (GPR) model to approximate the objective function and an acquisition function to suggest candidate points. It is well-known that BO does not scale well for high-dimensional problems because the GPR model requires substantially more data points to achieve sufficient accuracy and acquisition optimization becomes computationally expensive in high dimensions. Several recent works aim at addressing these issues, e.g., methods that implement online variable selection or conduct the search on a lower-dimensional sub-manifold of the original search space. Advancing our previous work of PCA-BO that learns a linear sub-manifold, this paper proposes a novel kernel PCA-assisted BO (KPCA-BO) algorithm, which embeds a non-linear sub-manifold in the search space and performs BO on this sub-manifold. Intuitively, constructing the GPR model on a lower-dimensional sub-manifold helps improve the modeling accuracy without requiring much more data from the objective function. Also, our approach defines the acquisition function on the lower-dimensional sub-manifold, making the acquisition optimization more manageable. We compare the performance of KPCA-BO to the vanilla BO and PCA-BO on the multi-modal problems of the COCO/BBOB benchmark suite. Empirical results show that KPCA-BO outperforms BO in terms of convergence speed on most test problems, and this benefit becomes more significant when the dimensionality increases. For the 60D functions, KPCA-BO surpasses PCA-BO in many test cases. Moreover, it efficiently reduces the CPU time required to train the GPR model and optimize the acquisition function compared to the vanilla BO.
    DeepAdversaries: Examining the Robustness of Deep Learning Models for Galaxy Morphology Classification. (arXiv:2112.14299v2 [cs.LG] UPDATED)
    Data processing and analysis pipelines in cosmological survey experiments introduce data perturbations that can significantly degrade the performance of deep learning-based models. Given the increased adoption of supervised deep learning methods for processing and analysis of cosmological survey data, the assessment of data perturbation effects and the development of methods that increase model robustness are increasingly important. In the context of morphological classification of galaxies, we study the effects of perturbations in imaging data. In particular, we examine the consequences of using neural networks when training on baseline data and testing on perturbed data. We consider perturbations associated with two primary sources: 1) increased observational noise as represented by higher levels of Poisson noise and 2) data processing noise incurred by steps such as image compression or telescope errors as represented by one-pixel adversarial attacks. We also test the efficacy of domain adaptation techniques in mitigating the perturbation-driven errors. We use classification accuracy, latent space visualizations, and latent space distance to assess model robustness. Without domain adaptation, we find that processing pixel-level errors easily flip the classification into an incorrect class and that higher observational noise makes the model trained on low-noise data unable to classify galaxy morphologies. On the other hand, we show that training with domain adaptation improves model robustness and mitigates the effects of these perturbations, improving the classification accuracy by 23% on data with higher observational noise. Domain adaptation also increases by a factor of ~2.3 the latent space distance between the baseline and the incorrectly classified one-pixel perturbed image, making the model more robust to inadvertent perturbations.
    An Extensive Data Processing Pipeline for MIMIC-IV. (arXiv:2204.13841v1 [cs.LG])
    An increasing amount of research is being devoted to applying machine learning methods to electronic health record (EHR) data for various clinical tasks. This growing area of research has exposed the limitation of accessibility of EHR datasets for all, as well as the reproducibility of different modeling frameworks. One reason for these limitations is the lack of standardized pre-processing pipelines. MIMIC is a freely available EHR dataset in a raw format that has been used in numerous studies. The absence of standardized pre-processing steps serves as a major barrier to the wider adoption of the dataset. It also leads to different cohorts being used in downstream tasks, limiting the ability to compare the results among similar studies. Contrasting studies also use various distinct performance metrics, which can greatly reduce the ability to compare model results. In this work, we provide an end-to-end fully customizable pipeline to extract, clean, and pre-process data; and to predict and evaluate the fourth version of the MIMIC dataset (MIMIC-IV) for ICU and non-ICU-related clinical time-series prediction tasks.
    Particle Swarm Optimization Based Demand Response Using Artificial Neural Network Based Load Prediction. (arXiv:2204.13990v1 [cs.NE])
    In the present study, a Particle Swarm Optimization (PSO) based Demand Response (DR) model, using Artificial Neural Network (ANN) to predict load is proposed. The electrical load and climatological data of a residential area in Austin city in Texas are used as the inputs of the ANN. Then, the outcomes with the day-ahead prices data are used to solve the load shifting and cost reduction problem. According to the results, the proposed model has the ability to decrease payment costs and peak load.
    Multi-Agent MDP Homomorphic Networks. (arXiv:2110.04495v2 [cs.LG] UPDATED)
    This paper introduces Multi-Agent MDP Homomorphic Networks, a class of networks that allows distributed execution using only local information, yet is able to share experience between global symmetries in the joint state-action space of cooperative multi-agent systems. In cooperative multi-agent systems, complex symmetries arise between different configurations of the agents and their local observations. For example, consider a group of agents navigating: rotating the state globally results in a permutation of the optimal joint policy. Existing work on symmetries in single agent reinforcement learning can only be generalized to the fully centralized setting, because such approaches rely on the global symmetry in the full state-action spaces, and these can result in correspondences across agents. To encode such symmetries while still allowing distributed execution we propose a factorization that decomposes global symmetries into local transformations. Our proposed factorization allows for distributing the computation that enforces global symmetries over local agents and local interactions. We introduce a multi-agent equivariant policy network based on this factorization. We show empirically on symmetric multi-agent problems that globally symmetric distributable policies improve data efficiency compared to non-equivariant baselines.
    Tailored Uncertainty Estimation for Deep Learning Systems. (arXiv:2204.13963v1 [cs.LG])
    Uncertainty estimation bears the potential to make deep learning (DL) systems more reliable. Standard techniques for uncertainty estimation, however, come along with specific combinations of strengths and weaknesses, e.g., with respect to estimation quality, generalization abilities and computational complexity. To actually harness the potential of uncertainty quantification, estimators are required whose properties closely match the requirements of a given use case. In this work, we propose a framework that, firstly, structures and shapes these requirements, secondly, guides the selection of a suitable uncertainty estimation method and, thirdly, provides strategies to validate this choice and to uncover structural weaknesses. By contributing tailored uncertainty estimation in this sense, our framework helps to foster trustworthy DL systems. Moreover, it anticipates prospective machine learning regulations that require, e.g., in the EU, evidences for the technical appropriateness of machine learning systems. Our framework provides such evidences for system components modeling uncertainty.
    Hyperbolic Hierarchical Knowledge Graph Embeddings for Link Prediction in Low Dimensions. (arXiv:2204.13704v1 [cs.LG])
    Knowledge graph embeddings (KGE) have been validated as powerful methods for inferring missing links in knowledge graphs (KGs) since they map entities into Euclidean space and treat relations as transformations of entities. Currently, some Euclidean KGE methods model semantic hierarchies prevalent in KGs and promote the performance of link prediction. For hierarchical data, instead of traditional Euclidean space, hyperbolic space as an embedding space has shown the promise of high fidelity and low memory consumption; however, existing hyperbolic KGE methods neglect to model them. To address this issue, we propose a novel KGE model -- hyperbolic hierarchical KGE (HypHKGE). To be specific, we first design the attention-based learnable curvatures for hyperbolic space to preserve rich semantic hierarchies. Moreover, we define the hyperbolic hierarchical transformations based on the theory of hyperbolic geometry, which utilize hierarchies that we preserved to infer the links. Experiments show that HypHKGE can effectively model semantic hierarchies in hyperbolic space and outperforms the state-of-the-art hyperbolic methods, especially in low dimensions.
    An Online Ensemble Learning Model for Detecting Attacks in Wireless Sensor Networks. (arXiv:2204.13814v1 [cs.NI])
    In today's modern world, the usage of technology is unavoidable and the rapid advances in the Internet and communication fields have resulted to expand the Wireless Sensor Network (WSN) technology. A huge number of sensing devices collect and/or generate numerous sensory data throughout time for a wide range of fields and applications. However, WSN has been proven to be vulnerable to security breaches, the harsh and unattended deployment of these networks, combined with their constrained resources and the volume of data generated introduce a major security concern. WSN applications are extremely critical, it is essential to build reliable solutions that involve fast and continuous mechanisms for online data stream analysis enabling the detection of attacks and intrusions. In this context, our aim is to develop an intelligent, efficient, and updatable intrusion detection system by applying an important machine learning concept known as ensemble learning in order to improve detection performance. Although ensemble models have been proven to be useful in offline learning, they have received less attention in streaming applications. In this paper, we examine the application of different homogeneous and heterogeneous online ensembles in sensory data analysis, on a specialized wireless sensor network-detection system (WSN-DS) dataset in order to classify four types of attacks: Blackhole attack, Grayhole, Flooding, and Scheduling among normal network traffic. Among the proposed novel online ensembles, both the heterogeneous ensemble consisting of an Adaptive Random Forest (ARF) combined with the Hoeffding Adaptive Tree (HAT) algorithm and the homogeneous ensemble HAT made up of 10 models achieved higher detection rates of 96.84% and 97.2%, respectively. The above models are efficient and effective in dealing with concept drift, while taking into account the resource constraints of WSNs.
    Bayesian Information Criterion for Event-based Multi-trial Ensemble data. (arXiv:2204.14096v1 [stat.ML])
    Transient recurring phenomena are ubiquitous in many scientific fields like neuroscience and meteorology. Time inhomogenous Vector Autoregressive Models (VAR) may be used to characterize peri-event system dynamics associated with such phenomena, and can be learned by exploiting multi-dimensional data gathering samples of the evolution of the system in multiple time windows comprising, each associated with one occurrence of the transient phenomenon, that we will call "trial". However, optimal VAR model order selection methods, commonly relying on the Akaike or Bayesian Information Criteria (AIC/BIC), are typically not designed for multi-trial data. Here we derive the BIC methods for multi-trial ensemble data which are gathered after the detection of the events. We show using simulated bivariate AR models that the multi-trial BIC is able to recover the real model order. We also demonstrate with simulated transient events and real data that the multi-trial BIC is able to estimate a sufficiently small model order for dynamic system modeling.
    Local Explanation of Dimensionality Reduction. (arXiv:2204.14012v1 [cs.LG])
    Dimensionality reduction (DR) is a popular method for preparing and analyzing high-dimensional data. Reduced data representations are less computationally intensive and easier to manage and visualize, while retaining a significant percentage of their original information. Aside from these advantages, these reduced representations can be difficult or impossible to interpret in most circumstances, especially when the DR approach does not provide further information about which features of the original space led to their construction. This problem is addressed by Interpretable Machine Learning, a subfield of Explainable Artificial Intelligence that addresses the opacity of machine learning models. However, current research on Interpretable Machine Learning has been focused on supervised tasks, leaving unsupervised tasks like Dimensionality Reduction unexplored. In this paper, we introduce LXDR, a technique capable of providing local interpretations of the output of DR techniques. Experiment results and two LXDR use case examples are presented to evaluate its usefulness.
    SwiftAgg: Communication-Efficient and Dropout-Resistant Secure Aggregation for Federated Learning with Worst-Case Security Guarantees. (arXiv:2202.04169v2 [cs.IT] UPDATED)
    We propose SwiftAgg, a novel secure aggregation protocol for federated learning systems, where a central server aggregates local models of $N$ distributed users, each of size $L$, trained on their local data, in a privacy-preserving manner. Compared with state-of-the-art secure aggregation protocols, SwiftAgg significantly reduces the communication overheads without any compromise on security. Specifically, in presence of at most $D$ dropout users, SwiftAgg achieves a users-to-server communication load of $(T+1)L$ and a users-to-users communication load of up to $(N-1)(T+D+1)L$, with a worst-case information-theoretic security guarantee, against any subset of up to $T$ semi-honest users who may also collude with the curious server. The key idea of SwiftAgg is to partition the users into groups of size $D+T+1$, then in the first phase, secret sharing and aggregation of the individual models are performed within each group, and then in the second phase, model aggregation is performed on $D+T+1$ sequences of users across the groups. If a user in a sequence drops out in the second phase, the rest of the sequence remain silent. This design allows only a subset of users to communicate with each other, and only the users in a single group to directly communicate with the server, eliminating the requirements of 1) all-to-all communication network across users; and 2) all users communicating with the server, for other secure aggregation protocols. This helps to substantially slash the communication costs of the system.
    Framework for Behavioral Disorder Detection Using Machine Learning and Application of Virtual Cognitive Behavioral Therapy in COVID-19 Pandemic. (arXiv:2204.13900v1 [cs.LG])
    In this modern world, people are becoming more self-centered and unsocial. On the other hand, people are stressed, becoming more anxious during COVID-19 pandemic situation and exhibits symptoms of behavioral disorder. To measure the symptoms of behavioral disorder, usually psychiatrist use long hour sessions and inputs from specific questionnaire. This process is time consuming and sometime is ineffective to detect the right behavioral disorder. Also, reserved people sometime hesitate to follow this process. We have created a digital framework which can detect behavioral disorder and prescribe virtual Cognitive Behavioral Therapy (vCBT) for recovery. By using this framework people can input required data that are highly responsible for the three behavioral disorders namely depression, anxiety and internet addiction. We have applied machine learning technique to detect specific behavioral disorder from samples. This system guides the user with basic understanding and treatment through vCBT from anywhere any time which would potentially be the steppingstone for the user to be conscious and pursue right treatment.
    Triformer: Triangular, Variable-Specific Attentions for Long Sequence Multivariate Time Series Forecasting--Full Version. (arXiv:2204.13767v1 [cs.LG])
    A variety of real-world applications rely on far future information to make decisions, thus calling for efficient and accurate long sequence multivariate time series forecasting. While recent attention-based forecasting models show strong abilities in capturing long-term dependencies, they still suffer from two key limitations. First, canonical self attention has a quadratic complexity w.r.t. the input time series length, thus falling short in efficiency. Second, different variables' time series often have distinct temporal dynamics, which existing studies fail to capture, as they use the same model parameter space, e.g., projection matrices, for all variables' time series, thus falling short in accuracy. To ensure high efficiency and accuracy, we propose Triformer, a triangular, variable-specific attention. (i) Linear complexity: we introduce a novel patch attention with linear complexity. When stacking multiple layers of the patch attentions, a triangular structure is proposed such that the layer sizes shrink exponentially, thus maintaining linear complexity. (ii) Variable-specific parameters: we propose a light-weight method to enable distinct sets of model parameters for different variables' time series to enhance accuracy without compromising efficiency and memory usage. Strong empirical evidence on four datasets from multiple domains justifies our design choices, and it demonstrates that Triformer outperforms state-of-the-art methods w.r.t. both accuracy and efficiency. This is an extended version of "Triformer: Triangular, Variable-Specific Attentions for Long Sequence Multivariate Time Series Forecasting", to appear in IJCAI 2022 [Cirstea et al., 2022a], including additional experimental results.
    Fairer LP-based Online Allocation via Analytic Center. (arXiv:2110.14621v3 [cs.DS] UPDATED)
    In this paper, we consider an online resource allocation problem where a decision maker accepts or rejects incoming customer requests irrevocably in order to maximize expected reward given limited resources. At each time, a new order/customer/bid is revealed with a request of some resource(s) and a reward. We consider a stochastic setting where all the orders are i.i.d. sampled from an unknown distribution. Such formulation arises from many classic applications such as the canonical (quantity-based) network revenue management problem and the Adwords problem. While the literature on the topic mainly focuses on regret minimization, our paper considers the \textit{fairness} aspect of the problem. On a high level, we define the fairness in a way that a fair online algorithm should treat similar agents/customers similarly, and the decision made for similar agents/customers should be consistent over time. To achieve this goal, we define the fair offline solution as the analytic center of the offline optimal solution set, and introduce \textit{cumulative unfairness} as the cumulative deviation from the online solutions to the fair offline solution over time. We propose a fair algorithm based on an interior-point LP solver and a mechanism that dynamically detects unfair resource spending. Our algorithm achieves cumulative unfairness on the scale of order $O(\log(T))$, while maintains the regret to be bounded without dependency on $T$. In addition, compared to the literature, our result is produced under less restrictive assumptions on the degeneracy of the underlying linear program.
    Convergence of gradient descent for deep neural networks. (arXiv:2203.16462v2 [cs.LG] UPDATED)
    Optimization by gradient descent has been one of main drivers of the "deep learning revolution". Yet, despite some recent progress for extremely wide networks, it remains an open problem to understand why gradient descent often converges to global minima when training deep neural networks. This article presents a new criterion for convergence of gradient descent to a global minimum, which is provably more powerful than the best available criteria from the literature, namely, the Lojasiewicz inequality and its generalizations. This criterion is used to show that gradient descent with proper initialization converges to a global minimum when training any feedforward neural network with smooth and strictly increasing activation functions, provided that the input dimension is greater than or equal to the number of data points.
    Machine Learning-Based GPS Multipath Detection Method Using Dual Antennas. (arXiv:2204.14001v1 [cs.NI])
    In urban areas, global navigation satellite system (GNSS) signals are often reflected or blocked by buildings, thus resulting in large positioning errors. In this study, we proposed a machine learning approach for global positioning system (GPS) multipath detection that uses dual antennas. A machine learning model that could classify GPS signal reception conditions was trained with several GPS measurements selected as suggested features. We applied five features for machine learning, including a feature obtained from the dual antennas, and evaluated the classification performance of the model, after applying four machine learning algorithms: gradient boosting decision tree (GBDT), random forest, decision tree, and K-nearest neighbor (KNN). It was found that a classification accuracy of 82%-96% was achieved when the test data set was collected at the same locations as those of the training data set. However, when the test data set was collected at locations different from those of the training data, a classification accuracy of 44%-77% was obtained.
    Depth Estimation with Simplified Transformer. (arXiv:2204.13791v1 [cs.CV])
    Transformer and its variants have shown state-of-the-art results in many vision tasks recently, ranging from image classification to dense prediction. Despite of their success, limited work has been reported on improving the model efficiency for deployment in latency-critical applications, such as autonomous driving and robotic navigation. In this paper, we aim at improving upon the existing transformers in vision, and propose a method for self-supervised monocular Depth Estimation with Simplified Transformer (DEST), which is efficient and particularly suitable for deployment on GPU-based platforms. Through strategic design choices, our model leads to significant reduction in model size, complexity, as well as inference latency, while achieving superior accuracy as compared to state-of-the-art. We also show that our design generalize well to other dense prediction task without bells and whistles.
    Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks. (arXiv:2111.02278v2 [cs.LG] UPDATED)
    Understanding the properties of neural networks trained via stochastic gradient descent (SGD) is at the heart of the theory of deep learning. In this work, we take a mean-field view, and consider a two-layer ReLU network trained via SGD for a univariate regularized regression problem. Our main result is that SGD is biased towards a simple solution: at convergence, the ReLU network implements a piecewise linear map of the inputs, and the number of "knot" points - i.e., points where the tangent of the ReLU network estimator changes - between two consecutive training inputs is at most three. In particular, as the number of neurons of the network grows, the SGD dynamics is captured by the solution of a gradient flow and, at convergence, the distribution of the weights approaches the unique minimizer of a related free energy, which has a Gibbs form. Our key technical contribution consists in the analysis of the estimator resulting from this minimizer: we show that its second derivative vanishes everywhere, except at some specific locations which represent the "knot" points. We also provide empirical evidence that knots at locations distinct from the data points might occur, as predicted by our theory.
    Stochastic Video Prediction with Structure and Motion. (arXiv:2203.10528v2 [cs.CV] UPDATED)
    While stochastic video prediction models enable future prediction under uncertainty, they mostly fail to model the complex dynamics of real-world scenes. For example, they cannot provide reliable predictions for scenes with a moving camera and independently moving foreground objects in driving scenarios. The existing methods fail to fully capture the dynamics of the structured world by only focusing on changes in pixels. In this paper, we assume that there is an underlying process creating observations in a video and propose to factorize it into static and dynamic components. We model the static part based on the scene structure and the ego-motion of the vehicle, and the dynamic part based on the remaining motion of the dynamic objects. By learning separate distributions of changes in foreground and background, we can decompose the scene into static and dynamic parts and separately model the change in each. Our experiments demonstrate that disentangling structure and motion helps stochastic video prediction, leading to better future predictions in complex driving scenarios on two real-world driving datasets, KITTI and Cityscapes.
    Feature extraction using Spectral Clustering for Gene Function Prediction using Hierarchical Multi-label Classification. (arXiv:2203.13551v2 [cs.LG] UPDATED)
    Gene annotation addresses the problem of predicting unknown associations between gene and functions (e.g., biological processes) of a specific organism. Despite recent advances, the cost and time demanded by annotation procedures that rely largely on in vivo biological experiments remain prohibitively high. This paper presents a novel in silico approach for to the annotation problem that combines cluster analysis and hierarchical multi-label classification (HMC). The approach uses spectral clustering to extract new features from the gene co-expression network (GCN) and enrich the prediction task. HMC is used to build multiple estimators that consider the hierarchical structure of gene functions. The proposed approach is applied to a case study on Zea mays, one of the most dominant and productive crops in the world. The results illustrate how in silico approaches are key to reduce the time and costs of gene annotation. More specifically, they highlight the importance of: (i) building new features that represent the structure of gene relationships in GCNs to annotate genes; and (ii) taking into account the structure of biological processes to obtain consistent predictions.
    Statistical applications of contrastive learning. (arXiv:2204.13999v1 [cs.LG])
    The likelihood function plays a crucial role in statistical inference and experimental design. However, it is computationally intractable for several important classes of statistical models, including energy-based models and simulator-based models. Contrastive learning is an intuitive and computationally feasible alternative to likelihood-based learning. We here first provide an introduction to contrastive learning and then show how we can use it to derive methods for diverse statistical problems, namely parameter estimation for energy-based models, Bayesian inference for simulator-based models, as well as experimental design.
    Dynamic Diagnosis of the Progress and Shortcomings of Student Learning using Machine Learning based on Cognitive, Social, and Emotional Features. (arXiv:2204.13989v1 [cs.CY])
    Student diversity, like academic background, learning styles, career and life goals, ethnicity, age, social and emotional characteristics, course load and work schedule, offers unique opportunities in education, like learning new skills, peer mentoring and example setting. But student diversity can be challenging too as it adds variability in the way in which students learn and progress over time. A single teaching approach is likely to be ineffective and result in students not meeting their potential. Automated support could address limitations of traditional teaching by continuously assessing student learning and implementing needed interventions. This paper discusses a novel methodology based on data analytics and Machine Learning to measure and causally diagnose the progress and shortcomings of student learning, and then utilizes the insight gained on individuals to optimize learning. Diagnosis pertains to dynamic diagnostic formative assessment, which aims to uncover the causes of learning shortcomings. The methodology groups learning difficulties into four categories: recall from memory, concept adjustment, concept modification, and problem decomposition into sub-goals (sub-problems) and concept combination. Data models are predicting the occurrence of each of the four challenge types, as well as a student's learning trajectory. The models can be used to automatically create real-time, student-specific interventions (e.g., learning cues) to address less understood concepts. We envision that the system will enable new adaptive pedagogical approaches to unleash student learning potential through customization of the course material to the background, abilities, situation, and progress of each student; and leveraging diversity-related learning experiences.
    Science Checker: Extractive-Boolean Question Answering For Scientific Fact Checking. (arXiv:2204.12263v2 [cs.CL] UPDATED)
    With the explosive growth of scientific publications, making the synthesis of scientific knowledge and fact checking becomes an increasingly complex task. In this paper, we propose a multi-task approach for verifying the scientific questions based on a joint reasoning from facts and evidence in research articles. We propose an intelligent combination of (1) an automatic information summarization and (2) a Boolean Question Answering which allows to generate an answer to a scientific question from only extracts obtained after summarization. Thus on a given topic, our proposed approach conducts structured content modeling based on paper abstracts to answer a scientific question while highlighting texts from paper that discuss the topic. We based our final system on an end-to-end Extractive Question Answering (EQA) combined with a three outputs classification model to perform in-depth semantic understanding of a question to illustrate the aggregation of multiple responses. With our light and fast proposed architecture, we achieved an average error rate of 4% and a F1-score of 95.6%. Our results are supported via experiments with two QA models (BERT, RoBERTa) over 3 Million Open Access (OA) articles in the medical and health domains on Europe PMC.
    Toward Degradation-Robust Voice Conversion. (arXiv:2110.07537v3 [eess.AS] UPDATED)
    Any-to-any voice conversion technologies convert the vocal timbre of an utterance to any speaker even unseen during training. Although there have been several state-of-the-art any-to-any voice conversion models, they were all based on clean utterances to convert successfully. However, in real-world scenarios, it is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations. It thus becomes highly desired to understand how these degradations affect voice conversion and build a degradation-robust model. We report in this paper the first comprehensive study on the degradation robustness of any-to-any voice conversion. We show that the performance of state-of-the-art models nowadays was severely hampered given degraded utterances. To this end, we then propose speech enhancement concatenation and denoising training to improve the robustness. In addition to common degradations, we also consider adversarial noises, which alter the model output significantly yet are human-imperceptible. It was shown that both concatenations with off-the-shelf speech enhancement models and denoising training on voice conversion models could improve the robustness, while each of them had pros and cons.  ( 2 min )
    One-Way Matching of Datasets with Low Rank Signals. (arXiv:2204.13858v1 [math.ST])
    We study one-way matching of a pair of datasets with low rank signals. Under a stylized model, we first derive information-theoretic limits of matching. We then show that linear assignment with projected data achieves fast rates of convergence and sometimes even minimax rate optimality for this task. The theoretical error bounds are corroborated by simulated examples. Furthermore, we illustrate practical use of the matching procedure on two single-cell data examples.
    Post-hoc Interpretability for Neural NLP: A Survey. (arXiv:2108.04840v4 [cs.CL] UPDATED)
    Neural networks for NLP are becoming increasingly complex and widespread, and there is a growing concern if these models are responsible to use. Explaining models helps to address the safety and ethical concerns and is essential for accountability. Interpretability serves to provide these explanations in terms that are understandable to humans. Additionally, post-hoc methods provide explanations after a model is learned and are generally model-agnostic. This survey provides a categorization of how recent post-hoc interpretability methods communicate explanations to humans, it discusses each method in-depth, and how they are validated, as the latter is often a common concern.
    AGIC: Approximate Gradient Inversion Attack on Federated Learning. (arXiv:2204.13784v1 [cs.LG])
    Federated learning is a private-by-design distributed learning paradigm where clients train local models on their own data before a central server aggregates their local updates to compute a global model. Depending on the aggregation method used, the local updates are either the gradients or the weights of local learning models. Recent reconstruction attacks apply a gradient inversion optimization on the gradient update of a single minibatch to reconstruct the private data used by clients during training. As the state-of-the-art reconstruction attacks solely focus on single update, realistic adversarial scenarios are overlooked, such as observation across multiple updates and updates trained from multiple mini-batches. A few studies consider a more challenging adversarial scenario where only model updates based on multiple mini-batches are observable, and resort to computationally expensive simulation to untangle the underlying samples for each local step. In this paper, we propose AGIC, a novel Approximate Gradient Inversion Attack that efficiently and effectively reconstructs images from both model or gradient updates, and across multiple epochs. In a nutshell, AGIC (i) approximates gradient updates of used training samples from model updates to avoid costly simulation procedures, (ii) leverages gradient/model updates collected from multiple epochs, and (iii) assigns increasing weights to layers with respect to the neural network structure for reconstruction quality. We extensively evaluate AGIC on three datasets, CIFAR-10, CIFAR-100 and ImageNet. Our results show that AGIC increases the peak signal-to-noise ratio (PSNR) by up to 50% compared to two representative state-of-the-art gradient inversion attacks. Furthermore, AGIC is faster than the state-of-the-art simulation based attack, e.g., it is 5x faster when attacking FedAvg with 8 local steps in between model updates.
    Modular Domain Adaptation. (arXiv:2204.14213v1 [cs.CL])
    Off-the-shelf models are widely used by computational social science researchers to measure properties of text, such as sentiment. However, without access to source data it is difficult to account for domain shift, which represents a threat to validity. Here, we treat domain adaptation as a modular process that involves separate model producers and model consumers, and show how they can independently cooperate to facilitate more accurate measurements of text. We introduce two lightweight techniques for this scenario, and demonstrate that they reliably increase out-of-domain accuracy on four multi-domain text classification datasets when used with linear and contextual embedding models. We conclude with recommendations for model producers and consumers, and release models and replication code to accompany this paper.  ( 2 min )
    Physical Deep Learning with Biologically Plausible Training Method. (arXiv:2204.13991v1 [cs.NE])
    The ever-growing demand for further advances in artificial intelligence motivated research on unconventional computation based on analog physical devices. While such computation devices mimic brain-inspired analog information processing, learning procedures still relies on methods optimized for digital processing such as backpropagation. Here, we present physical deep learning by extending a biologically plausible training algorithm called direct feedback alignment. As the proposed method is based on random projection with arbitrary nonlinear activation, we can train a physical neural network without knowledge about the physical system. In addition, we can emulate and accelerate the computation for this training on a simple and scalable physical system. We demonstrate the proof-of-concept using a hierarchically connected optoelectronic recurrent neural network called deep reservoir computer. By constructing an FPGA-assisted optoelectronic benchtop, we confirmed the potential for accelerated computation with competitive performance on benchmarks. Our results provide practical solutions for the training and acceleration of neuromorphic computation.  ( 2 min )
    Neighbor-Based Optimized Logistic Regression Machine Learning Model For Electric Vehicle Occupancy Detection. (arXiv:2204.13702v1 [cs.LG])
    This paper presents an optimized logistic regression machine learning model that predicts the occupancy of an Electric Vehicle (EV) charging station given the occupancy of neighboring stations. The model was optimized for the time of day. Trained on data from 57 EV charging stations around the University of California San Diego campus, the model achieved an 88.43% average accuracy and 92.23% maximum accuracy in predicting occupancy, outperforming a persistence model benchmark.  ( 2 min )
    HyperJump: Accelerating HyperBand via Risk Modelling. (arXiv:2108.02479v3 [cs.LG] UPDATED)
    In the literature on hyper-parameter tuning, a number of recent solutions rely on low-fidelity observations (e.g., training with sub-sampled datasets or for short periods of time) to extrapolate good configurations to use when performing full training. Among these, HyperBand is arguably one of the most popular solutions, due to its efficiency and theoretically provable robustness. In this work, we introduce HyperJump, a new approach that builds on HyperBand's robust search strategy and complements it with novel model-based risk analysis techniques that accelerate the search by \textit{jumping} the evaluation of low risk configurations, i.e., configurations that are likely to be discarded by HyperBand. We evaluate HyperJump on a suite of hyper-parameter optimization problems and show that it provides over one-order of magnitude speed-ups, both in sequential and parallel deployments, on a variety of deep learning, kernel-based learning, and neural architectural search problems when compared to HyperBand and to several state-of-the-art optimizers.  ( 2 min )
    A Collection of Quality Diversity Optimization Problems Derived from Hyperparameter Optimization of Machine Learning Models. (arXiv:2204.14061v1 [cs.LG])
    The goal of Quality Diversity Optimization is to generate a collection of diverse yet high-performing solutions to a given problem at hand. Typical benchmark problems are, for example, finding a repertoire of robot arm configurations or a collection of game playing strategies. In this paper, we propose a set of Quality Diversity Optimization problems that tackle hyperparameter optimization of machine learning models - a so far underexplored application of Quality Diversity Optimization. Our benchmark problems involve novel feature functions, such as interpretability or resource usage of models. To allow for fast and efficient benchmarking, we build upon YAHPO Gym, a recently proposed open source benchmarking suite for hyperparameter optimization that makes use of high performing surrogate models and returns these surrogate model predictions instead of evaluating the true expensive black box function. We present results of an initial experimental study comparing different Quality Diversity optimizers on our benchmark problems. Furthermore, we discuss future directions and challenges of Quality Diversity Optimization in the context of hyperparameter optimization.  ( 2 min )
    Searching for Efficient Neural Architectures for On-Device ML on Edge TPUs. (arXiv:2204.14007v1 [cs.DC])
    On-device ML accelerators are becoming a standard in modern mobile system-on-chips (SoC). Neural architecture search (NAS) comes to the rescue for efficiently utilizing the high compute throughput offered by these accelerators. However, existing NAS frameworks have several practical limitations in scaling to multiple tasks and different target platforms. In this work, we provide a two-pronged approach to this challenge: (i) a NAS-enabling infrastructure that decouples model cost evaluation, search space design, and the NAS algorithm to rapidly target various on-device ML tasks, and (ii) search spaces crafted from group convolution based inverted bottleneck (IBN) variants that provide flexible quality/performance trade-offs on ML accelerators, complementing the existing full and depthwise convolution based IBNs. Using this approach we target a state-of-the-art mobile platform, Google Tensor SoC, and demonstrate neural architectures that improve the quality-performance pareto frontier for various computer vision (classification, detection, segmentation) as well as natural language processing tasks.  ( 2 min )
    Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization. (arXiv:2006.00425v3 [math.OC] UPDATED)
    Stochastic gradient methods (SGMs) have been extensively used for solving stochastic problems or large-scale machine learning problems. Recent works employ various techniques to improve the convergence rate of SGMs for both convex and nonconvex cases. Most of them require a large number of samples in some or all iterations of the improved SGMs. In this paper, we propose a new SGM, named PStorm, for solving nonconvex nonsmooth stochastic problems. With a momentum-based variance reduction technique, PStorm can achieve the optimal complexity result $O(\varepsilon^{-3})$ to produce a stochastic $\varepsilon$-stationary solution, if a mean-squared smoothness condition holds. Different from existing optimal methods, PStorm can achieve the ${O}(\varepsilon^{-3})$ result by using only one or $O(1)$ samples in every update. With this property, PStorm can be applied to online learning problems that favor real-time decisions based on one or $O(1)$ new observations. In addition, for large-scale machine learning problems, PStorm can generalize better by small-batch training than other optimal methods that require large-batch training and the vanilla SGM, as we demonstrate on training a sparse fully-connected neural network and a sparse convolutional neural network.  ( 2 min )
    Flamingo: a Visual Language Model for Few-Shot Learning. (arXiv:2204.14198v1 [cs.CV])
    Building models that can be rapidly adapted to numerous tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of the proposed Flamingo models, exploring and measuring their ability to rapidly adapt to a variety of image and video understanding benchmarks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple choice visual question-answering. For tasks lying anywhere on this spectrum, we demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples. On many of these benchmarks, Flamingo actually surpasses the performance of models that are fine-tuned on thousands of times more task-specific data.  ( 2 min )
    MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation. (arXiv:2111.12707v3 [cs.CV] UPDATED)
    Estimating 3D human poses from monocular videos is a challenging task due to depth ambiguity and self-occlusion. Most existing works attempt to solve both issues by exploiting spatial and temporal relationships. However, those works ignore the fact that it is an inverse problem where multiple feasible solutions (i.e., hypotheses) exist. To relieve this limitation, we propose a Multi-Hypothesis Transformer (MHFormer) that learns spatio-temporal representations of multiple plausible pose hypotheses. In order to effectively model multi-hypothesis dependencies and build strong relationships across hypothesis features, the task is decomposed into three stages: (i) Generate multiple initial hypothesis representations; (ii) Model self-hypothesis communication, merge multiple hypotheses into a single converged representation and then partition it into several diverged hypotheses; (iii) Learn cross-hypothesis communication and aggregate the multi-hypothesis features to synthesize the final 3D pose. Through the above processes, the final representation is enhanced and the synthesized pose is much more accurate. Extensive experiments show that MHFormer achieves state-of-the-art results on two challenging datasets: Human3.6M and MPI-INF-3DHP. Without bells and whistles, its performance surpasses the previous best result by a large margin of 3% on Human3.6M. Code and models are available at \url{https://github.com/Vegetebird/MHFormer}.  ( 2 min )
    Preoperative brain tumor imaging: models and software for segmentation and standardized reporting. (arXiv:2204.14199v1 [eess.IV])
    For patients suffering from brain tumor, prognosis estimation and treatment decisions are made by a multidisciplinary team based on a set of preoperative MR scans. Currently, the lack of standardized and automatic methods for tumor detection and generation of clinical reports represents a major hurdle. In this study, we investigate glioblastomas, lower grade gliomas, meningiomas, and metastases, through four cohorts of up to 4000 patients. Tumor segmentation models were trained using the AGU-Net architecture with different preprocessing steps and protocols. Segmentation performances were assessed in-depth using a wide-range of voxel and patient-wise metrics covering volume, distance, and probabilistic aspects. Finally, two software solutions have been developed, enabling an easy use of the trained models and standardized generation of clinical reports: Raidionics and Raidionics-Slicer. Segmentation performances were quite homogeneous across the four different brain tumor types, with an average true positive Dice ranging between 80% and 90%, patient-wise recall between 88% and 98%, and patient-wise precision around 95%. With our Raidionics software, running on a desktop computer with CPU support, tumor segmentation can be performed in 16 to 54 seconds depending on the dimensions of the MRI volume. For the generation of a standardized clinical report, including the tumor segmentation and features computation, 5 to 15 minutes are necessary. All trained models have been made open-access together with the source code for both software solutions and validation metrics computation. In the future, an automatic classification of the brain tumor type would be necessary to replace manual user input. Finally, the inclusion of post-operative segmentation in both software solutions will be key for generating complete post-operative standardized clinical reports.  ( 3 min )
    Automatic Machine Learning for Multi-Receiver CNN Technology Classifiers. (arXiv:2204.13819v1 [cs.LG])
    Convolutional Neural Networks (CNNs) are one of the most studied family of deep learning models for signal classification, including modulation, technology, detection, and identification. In this work, we focus on technology classification based on raw I/Q samples collected from multiple synchronized receivers. As an example use case, we study protocol identification of Wi-Fi, LTE-LAA, and 5G NR-U technologies that coexist over the 5 GHz Unlicensed National Information Infrastructure (U-NII) bands. Designing and training accurate CNN classifiers involve significant time and effort that goes into fine-tuning a model's architectural settings and determining the appropriate hyperparameter configurations, such as learning rate and batch size. We tackle the former by defining architectural settings themselves as hyperparameters. We attempt to automatically optimize these architectural parameters, along with other preprocessing (e.g., number of I/Q samples within each classifier input) and learning hyperparameters, by forming a Hyperparameter Optimization (HyperOpt) problem, which we solve in a near-optimal fashion using the Hyperband algorithm. The resulting near-optimal CNN (OCNN) classifier is then used to study classification accuracy for OTA as well as simulations datasets, considering various SNR values. We show that the number of receivers to construct multi-channel inputs for CNNs should be defined as a preprocessing hyperparameter to be optimized via Hyperband. OTA results reveal that our OCNN classifiers improve classification accuracy by 24.58% compared to manually tuned CNNs. We also study the effect of min-max normalization of I/Q samples within each classifier's input on generalization accuracy over simulated datasets with SNRs other than training set's SNR and show an average of 108.05% improvement when I/Q samples are normalized.  ( 2 min )
    A Neural Network-enhanced Reproducing Kernel Particle Method for Modeling Strain Localization. (arXiv:2204.13821v1 [cs.CE])
    Modeling the localized intensive deformation in a damaged solid requires highly refined discretization for accurate prediction, which significantly increases the computational cost. Although adaptive model refinement can be employed for enhanced effectiveness, it is cumbersome for the traditional mesh-based methods to perform while modeling the evolving localizations. In this work, neural network-enhanced reproducing kernel particle method (NN-RKPM) is proposed, where the location, orientation, and shape of the solution transition near a localization is automatically captured by the NN approximation via a block-level neural network optimization. The weights and biases in the blocked parametrization network control the location and orientation of the localization. The designed basic four-kernel NN block is capable of capturing a triple junction or a quadruple junction topological pattern, while more complicated localization topological patters are captured by the superposition of multiple four-kernel NN blocks. The standard RK approximation is then utilized to approximate the smooth part of the solution, which permits a much coarser discretization than the high-resolution discretization needed to capture sharp solution transitions with the conventional methods. A regularization of the neural network approximation is additionally introduced for discretization-independent material responses. The effectiveness of the proposed NN-RKPM is verified by a series of numerical verifications.  ( 2 min )
    Industry-academia research collaboration and knowledge co-creation: Patterns and anti-patterns. (arXiv:2204.14180v1 [cs.SE])
    Increasing the impact of software engineering research in the software industry and the society at large has long been a concern of high priority for the software engineering community. The problem of two cultures, research conducted in a vacuum (disconnected from the real world), or misaligned time horizons are just some of the many complex challenges standing in the way of successful industry-academia collaborations. This paper reports on the experience of research collaboration and knowledge co-creation between industry and academia in software engineering as a way to bridge the research-practice collaboration gap. Our experience spans 14 years of collaboration between researchers in software engineering and the European and Norwegian software and IT industry. Using the participant observation and interview methods we have collected and afterwards analyzed an extensive record of qualitative data. Drawing upon the findings made and the experience gained, we provide a set of 14 patterns and 14 anti-patterns for industry-academia collaborations, aimed to support other researchers and practitioners in establishing and running research collaboration projects in software engineering.  ( 2 min )
    Application of machine learning methods to detect and classify Core images using GAN and texture recognition. (arXiv:2204.14224v1 [cs.CV])
    During exploration campaigns, oil companies rely heavily on drill core samples as they provide valuable geological information that helps them find important oil deposits. Traditional core logging techniques are laborious and subjective. Core imaging, a new technique in the oil industry, is used to supplement analysis by rapidly characterising large quantities of drill cores in a nondestructive and noninvasive manner. In this paper, we will present the problem of core detection and classification. The first problem is detecting the cores and segmenting the holes in images by using Faster RCNN and Mask RCNN models respectively. The second problem is filling the hole in the core image by applying the Generative adversarial network(GAN) technique and using Contextual Residual Aggregation(CRA) which creates high frequency residual for missing contents in images. And finally applying Texture recognition models for the classification of core images.  ( 2 min )
    Task Embedding Temporal Convolution Networks for Transfer Learning Problems in Renewable Power Time-Series Forecast. (arXiv:2204.13908v1 [cs.LG])
    Task embeddings in multi-layer perceptrons for multi-task learning and inductive transfer learning in renewable power forecasts have recently been introduced. In many cases, this approach improves the forecast error and reduces the required training data. However, it does not take the seasonal influences in power forecasts within a day into account, i.e., the diurnal cycle. Therefore, we extended this idea to temporal convolutional networks to consider those seasonalities. We propose transforming the embedding space, which contains the latent similarities between tasks, through convolution and providing these results to the network's residual block. The proposed architecture significantly improves up to 25 percent for multi-task learning for power forecasts on the EuropeWindFarm and GermanSolarFarm dataset compared to the multi-layer perceptron approach. Based on the same data, we achieve a ten percent improvement for the wind datasets and more than 20 percent in most cases for the solar dataset for inductive transfer learning without catastrophic forgetting. Finally, we are the first proposing zero-shot learning for renewable power forecasts to provide predictions even if no training data is available.  ( 2 min )
    CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers. (arXiv:2204.14217v1 [cs.CV])
    The development of the transformer-based text-to-image models are impeded by its slow generation and complexity for high-resolution images. In this work, we put forward a solution based on hierarchical transformers and local parallel auto-regressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general language model (CogLM), and finetune it for fast super-resolution. The new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL-E-2, and naturally supports interactive text-guided editing on images.  ( 2 min )
    Escaping Spurious Local Minima of Low-Rank Matrix Factorization Through Convex Lifting. (arXiv:2204.14067v1 [cs.LG])
    This work proposes a rapid global solver for nonconvex low-rank matrix factorization (MF) problems that we name MF-Global. Through convex lifting steps, our method efficiently escapes saddle points and spurious local minima ubiquitous in noisy real-world data, and is guaranteed to always converge to the global optima. Moreover, the proposed approach adaptively adjusts the rank for the factorization and provably identifies the optimal rank for MF automatically in the course of optimization through tools of manifold identification, and thus it also spends significantly less time on parameter tuning than existing MF methods, which require an exhaustive search for this optimal rank. On the other hand, when compared to methods for solving the lifted convex form only, MF-Global leads to significantly faster convergence and much shorter running time. Experiments on real-world large-scale recommendation system problems confirm that MF-Global can indeed effectively escapes spurious local solutions at which existing MF approaches stuck, and is magnitudes faster than state-of-the-art algorithms for the lifted convex form.  ( 2 min )
    Learned Gradient of a Regularizer for Plug-and-Play Gradient Descent. (arXiv:2204.13940v1 [eess.IV])
    The Plug-and-Play (PnP) framework allows integrating advanced image denoising priors into optimization algorithms, to efficiently solve a variety of image restoration tasks. The Plug-and-Play alternating direction method of multipliers (ADMM) and the Regularization by Denoising (RED) algorithms are two examples of such methods that made a breakthrough in image restoration. However, while the former method only applies to proximal algorithms, it has recently been shown that there exists no regularization that explains the RED algorithm when the denoisers lack Jacobian symmetry, which happen to be the case of most practical denoisers. To the best of our knowledge, there exists no method for training a network that directly represents the gradient of a regularizer, which can be directly used in Plug-and-Play gradient-based algorithms. We show that it is possible to train a denoiser along with a network that corresponds to the gradient of its regularizer. We use this gradient of the regularizer in gradient-based optimization methods and obtain better results comparing to other generic Plug-and-Play approaches. We also show that the regularizer can be used as a pre-trained network for unrolled gradient descent. Lastly, we show that the resulting denoiser allows for a quick convergence of the Plug-and-Play ADMM.  ( 2 min )
    Unsupervised Reinforcement Learning for Transferable Manipulation Skill Discovery. (arXiv:2204.13906v1 [cs.RO])
    Current reinforcement learning (RL) in robotics often experiences difficulty in generalizing to new downstream tasks due to the innate task-specific training paradigm. To alleviate it, unsupervised RL, a framework that pre-trains the agent in a task-agnostic manner without access to the task-specific reward, leverages active exploration for distilling diverse experience into essential skills or reusable knowledge. For exploiting such benefits also in robotic manipulation, we propose an unsupervised method for transferable manipulation skill discovery that ties structured exploration toward interacting behavior and transferable skill learning. It not only enables the agent to learn interaction behavior, the key aspect of the robotic manipulation learning, without access to the environment reward, but also to generalize to arbitrary downstream manipulation tasks with the learned task-agnostic skills. Through comparative experiments, we show that our approach achieves the most diverse interacting behavior and significantly improves sample efficiency in downstream tasks including the extension to multi-object, multitask problems.  ( 2 min )
    3D Common Corruptions and Data Augmentation. (arXiv:2203.01441v3 [cs.CV] UPDATED)
    We introduce a set of image transformations that can be used as corruptions to evaluate the robustness of models as well as data augmentation mechanisms for training neural networks. The primary distinction of the proposed transformations is that, unlike existing approaches such as Common Corruptions, the geometry of the scene is incorporated in the transformations -- thus leading to corruptions that are more likely to occur in the real world. We also introduce a set of semantic corruptions (e.g. natural object occlusions). We show these transformations are `efficient' (can be computed on-the-fly), `extendable' (can be applied on most image datasets), expose vulnerability of existing models, and can effectively make models more robust when employed as `3D data augmentation' mechanisms. The evaluations on several tasks and datasets suggest incorporating 3D information into benchmarking and training opens up a promising direction for robustness research.
    Tag-assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities. (arXiv:2204.13707v1 [cs.LG])
    Multimodal sentiment analysis has been studied under the assumption that all modalities are available. However, such a strong assumption does not always hold in practice, and most of multimodal fusion models may fail when partial modalities are missing. Several works have addressed the missing modality problem; but most of them only considered the single modality missing case, and ignored the practically more general cases of multiple modalities missing. To this end, in this paper, we propose a Tag-Assisted Transformer Encoder (TATE) network to handle the problem of missing uncertain modalities. Specifically, we design a tag encoding module to cover both the single modality and multiple modalities missing cases, so as to guide the network's attention to those missing modalities. Besides, we adopt a new space projection pattern to align common vectors. Then, a Transformer encoder-decoder network is utilized to learn the missing modality features. At last, the outputs of the Transformer encoder are used for the final sentiment classification. Extensive experiments are conducted on CMU-MOSI and IEMOCAP datasets, showing that our method can achieve significant improvements compared with several baselines.  ( 2 min )
    Learning cosmology and clustering with cosmic graphs. (arXiv:2204.13713v1 [astro-ph.CO])
    We train deep learning models on thousands of galaxy catalogues from the state-of-the-art hydrodynamic simulations of the CAMELS project to perform regression and inference. We employ Graph Neural Networks (GNNs), architectures designed to work with irregular and sparse data, like the distribution of galaxies in the Universe. We first show that GNNs can learn to compute the power spectrum of galaxy catalogues with a few percent accuracy. We then train GNNs to perform likelihood-free inference at the galaxy-field level. Our models are able to infer the value of $\Omega_{\rm m}$ with a $\sim12\%-13\%$ accuracy just from the positions of $\sim1000$ galaxies in a volume of $(25~h^{-1}{\rm Mpc})^3$ at $z=0$ while accounting for astrophysical uncertainties as modelled in CAMELS. Incorporating information from galaxy properties, such as stellar mass, stellar metallicity, and stellar radius, increases the accuracy to $4\%-8\%$. Our models are built to be translational and rotational invariant, and they can extract information from any scale larger than the minimum distance between two galaxies. However, our models are not completely robust: testing on simulations run with a different subgrid physics than the ones used for training does not yield as accurate results.  ( 2 min )
    Probabilistic Permutation Graph Search: Black-Box Optimization for Fairness in Ranking. (arXiv:2204.13765v1 [cs.LG])
    There are several measures for fairness in ranking, based on different underlying assumptions and perspectives. PL optimization with the REINFORCE algorithm can be used for optimizing black-box objective functions over permutations. In particular, it can be used for optimizing fairness measures. However, though effective for queries with a moderate number of repeating sessions, PL optimization has room for improvement for queries with a small number of repeating sessions. In this paper, we present a novel way of representing permutation distributions, based on the notion of permutation graphs. Similar to PL, our distribution representation, called PPG, can be used for black-box optimization of fairness. Different from PL, where pointwise logits are used as the distribution parameters, in PPG pairwise inversion probabilities together with a reference permutation construct the distribution. As such, the reference permutation can be set to the best sampled permutation regarding the objective function, making PPG suitable for both deterministic and stochastic rankings. Our experiments show that PPG, while comparable to PL for larger session repetitions (i.e., stochastic ranking), improves over PL for optimizing fairness metrics for queries with one session (i.e., deterministic ranking). Additionally, when accurate utility estimations are available, e.g., in tabular models, the performance of PPG in fairness optimization is significantly boosted compared to lower quality utility estimations from a learning to rank model, leading to a large performance gap with PL. Finally, the pairwise probabilities make it possible to impose pairwise constraints such as "item $d_1$ should always be ranked higher than item $d_2$." Such constraints can be used to simultaneously optimize the fairness metric and control another objective such as ranking performance.  ( 2 min )
    Fast Sampling of Diffusion Models with Exponential Integrator. (arXiv:2204.13902v1 [cs.LG])
    The past few years have witnessed the great success of Diffusion models~(DMs) in generating high-fidelity samples in generative modeling tasks. A major limitation of the DM is its notoriously slow sampling procedure which normally requires hundreds to thousands of time discretization steps of the learned diffusion process to reach the desired accuracy. Our goal is to develop a fast sampling method for DMs with much less number of steps while retaining high sample quality. To this end, we systematically analyze the sampling procedure in DMs and identify key factors that affect the sample quality, among which the method of discretization is most crucial. By carefully examining the learned diffusion process, we propose Diffusion Exponential Integrator Sampler~(DEIS). It is based on the Exponential Integrator designed for discretizing ordinary differential equations (ODEs) and leverages a semilinear structure of the learned diffusion process to reduce the discretization error. The proposed method can be applied to any DMs and can generate high-fidelity samples in as few as 10 steps. In our experiments, it takes about 3 minutes on one A6000 GPU to generate $50k$ images from CIFAR10. Moreover, by directly using pre-trained DMs, we achieve the state-of-art sampling performance when the number of score function evaluation~(NFE) is limited, e.g., 3.37 FID and 9.74 Inception score with only 15 NFEs on CIFAR10.  ( 2 min )
    Who will stay? Using Deep Learning to predict engagement of citizen scientists. (arXiv:2204.14046v1 [cs.LG])
    Citizen science and machine learning should be considered for monitoring the coastal and ocean environment due to the scale of threats posed by climate change and the limited resources to fill knowledge gaps. Using data from the annotation activity of citizen scientists in a Swedish marine project, we constructed Deep Neural Network models to predict forthcoming engagement. We tested the models to identify patterns in annotation engagement. Based on the results, it is possible to predict whether an annotator will remain active in future sessions. Depending on the goals of individual citizen science projects, it may also be necessary to identify either those volunteers who will leave or those who will continue annotating. This can be predicted by varying the threshold for the prediction. The engagement metrics used to construct the models are based on time and activity and can be used to infer latent characteristics of volunteers and predict their task interest based on their activity patterns. They can estimate if volunteers can accomplish a given number of tasks in a certain amount of time, identify early on who is likely to become a top contributor or identify who is likely to quit and provide them with targeted interventions. The novelty of our predictive models lies in the use of Deep Neural Networks and the sequence of volunteer annotations. A limitation of our models is that they do not use embeddings constructed from user profiles as input data, as many recommender systems do. We expect that including user profiles would improve prediction performance.  ( 2 min )
    Visualization and Optimization Techniques for High Dimensional Parameter Spaces. (arXiv:2204.13812v1 [cs.HC])
    High dimensional parameter space optimization is crucial in many applications. The parameters affecting this performance can be both numerical and categorical in their type. The existing techniques of black-box optimization and visual analytics are good in dealing with numerical parameters but analyzing categorical variables in context of the numerical variables are not well studied. Hence, we propose a novel approach, to create an auto-tuning framework for storage systems optimization combining both direct optimization techniques and visual analytics research. While the optimization algorithm will be the core of the system, visual analytics will provide a guideline with the help of an external agent (expert) to provide crucial hints to narrow down the large search space for the optimization engine. As part of the initial step towards creating an auto-tuning engine for storage systems optimization, we created an Interactive Configuration Explorer \textit{ICE}, which directly addresses the need of analysts to learn how the dependent numerical variable is affected by the parameter settings given multiple optimization objectives. No information is lost as ICE shows the complete distribution and statistics of the dependent variable in context with each categorical variable. Analysts can interactively filter the variables to optimize for certain goals such as achieving a system with maximum performance, low variance, etc. Our system was developed in tight collaboration with a group of systems performance researchers and its final effectiveness was evaluated with expert interviews, a comparative user study, and two case studies. We also discuss our research plan for creating an efficient auto-tuning framework combining black-box optimization and visual analytics for storage systems performance optimization.  ( 2 min )
    Coupling Deep Imputation with Multitask Learning for Downstream Tasks on Genomics Data. (arXiv:2204.13705v1 [q-bio.GN])
    Genomics data such as RNA gene expression, methylation and micro RNA expression are valuable sources of information for various clinical predictive tasks. For example, predicting survival outcomes, cancer histology type and other patients' related information is possible using not only clinical data but molecular data as well. Moreover, using these data sources together, for example in multitask learning, can boost the performance. However, in practice, there are many missing data points which leads to significantly lower patient numbers when analysing full cases, which in our setting refers to all modalities being present. In this paper we investigate how imputing data with missing values using deep learning coupled with multitask learning can help to reach state-of-the-art performance results using combined genomics modalities, RNA, micro RNA and methylation. We propose a generalised deep imputation method to impute values where a patient has all modalities present except one. Interestingly enough, deep imputation alone outperforms multitask learning alone for the classification and regression tasks across most combinations of modalities. In contrast, when using all modalities for survival prediction we observe that multitask learning alone outperforms deep imputation alone with statistical significance (adjusted p-value 0.03). Thus, both approaches are complementary when optimising performance for downstream predictive tasks.  ( 2 min )
    Tractable Uncertainty for Structure Learning. (arXiv:2204.14170v1 [cs.LG])
    Bayesian structure learning allows one to capture uncertainty over the causal directed acyclic graph (DAG) responsible for generating given data. In this work, we present Tractable Uncertainty for STructure learning (TRUST), a framework for approximate posterior inference that relies on probabilistic circuits as the representation of our posterior belief. In contrast to sample-based posterior approximations, our representation can capture a much richer space of DAGs, while being able to tractably answer a range of useful inference queries. We empirically show how probabilistic circuits can be used as an augmented representation for structure learning methods, leading to improvement in both the quality of inferred structures and posterior uncertainty. Experimental results also demonstrate the improved representational capacity of TRUST, outperforming competing methods on conditional query answering.  ( 2 min )
    GenDR: A Generalized Differentiable Renderer. (arXiv:2204.13845v1 [cs.CV])
    In this work, we present and study a generalized family of differentiable renderers. We discuss from scratch which components are necessary for differentiable rendering and formalize the requirements for each component. We instantiate our general differentiable renderer, which generalizes existing differentiable renderers like SoftRas and DIB-R, with an array of different smoothing distributions to cover a large spectrum of reasonable settings. We evaluate an array of differentiable renderer instantiations on the popular ShapeNet 3D reconstruction benchmark and analyze the implications of our results. Surprisingly, the simple uniform distribution yields the best overall results when averaged over 13 classes; in general, however, the optimal choice of distribution heavily depends on the task.  ( 2 min )
    Energy Minimization for Federated Asynchronous Learning on Battery-Powered Mobile Devices via Application Co-running. (arXiv:2204.13878v1 [cs.DC])
    Energy is an essential, but often forgotten aspect in large-scale federated systems. As most of the research focuses on tackling computational and statistical heterogeneity from the machine learning algorithms, the impact on the mobile system still remains unclear. In this paper, we design and implement an online optimization framework by connecting asynchronous execution of federated training with application co-running to minimize energy consumption on battery-powered mobile devices. From a series of experiments, we find that co-running the training process in the background with foreground applications gives the system a deep energy discount with negligible performance slowdown. Based on these results, we first study an offline problem assuming all the future occurrences of applications are available, and propose a dynamic programming-based algorithm. Then we propose an online algorithm using the Lyapunov framework to explore the solution space via the energy-staleness trade-off. The extensive experiments demonstrate that the online optimization framework can save over 60% energy with 3 times faster convergence speed compared to the previous schemes.  ( 2 min )
    Noise-reducing attention cross fusion learning transformer for histological image classification of osteosarcoma. (arXiv:2204.13838v1 [eess.IV])
    The degree of malignancy of osteosarcoma and its tendency to metastasize/spread mainly depend on the pathological grade (determined by observing the morphology of the tumor under a microscope). The purpose of this study is to use artificial intelligence to classify osteosarcoma histological images and to assess tumor survival and necrosis, which will help doctors reduce their workload, improve the accuracy of osteosarcoma cancer detection, and make a better prognosis for patients. The study proposes a typical transformer image classification framework by integrating noise reduction convolutional autoencoder and feature cross fusion learning (NRCA-FCFL) to classify osteosarcoma histological images. Noise reduction convolutional autoencoder could well denoise histological images of osteosarcoma, resulting in more pure images for osteosarcoma classification. Moreover, we introduce feature cross fusion learning, which integrates two scale image patches, to sufficiently explore their interactions by using additional classification tokens. As a result, a refined fusion feature is generated, which is fed to the residual neural network for label predictions. We conduct extensive experiments to evaluate the performance of the proposed approach. The experimental results demonstrate that our method outperforms the traditional and deep learning approaches on various evaluation metrics, with an accuracy of 99.17% to support osteosarcoma diagnosis.  ( 2 min )
    CAVES: A Dataset to facilitate Explainable Classification and Summarization of Concerns towards COVID Vaccines. (arXiv:2204.13746v1 [cs.CL])
    Convincing people to get vaccinated against COVID-19 is a key societal challenge in the present times. As a first step towards this goal, many prior works have relied on social media analysis to understand the specific concerns that people have towards these vaccines, such as potential side-effects, ineffectiveness, political factors, and so on. Though there are datasets that broadly classify social media posts into Anti-vax and Pro-Vax labels, there is no dataset (to our knowledge) that labels social media posts according to the specific anti-vaccine concerns mentioned in the posts. In this paper, we have curated CAVES, the first large-scale dataset containing about 10k COVID-19 anti-vaccine tweets labelled into various specific anti-vaccine concerns in a multi-label setting. This is also the first multi-label classification dataset that provides explanations for each of the labels. Additionally, the dataset also provides class-wise summaries of all the tweets. We also perform preliminary experiments on the dataset and show that this is a very challenging dataset for multi-label explainable classification and tweet summarization, as is evident by the moderate scores achieved by some state-of-the-art models. Our dataset and codes are available at: https://github.com/sohampoddar26/caves-data  ( 2 min )
    An Intriguing Property of Geophysics Inversion. (arXiv:2204.13731v1 [cs.LG])
    Inversion techniques are widely used to reconstruct subsurface physical properties (e.g., velocity, conductivity, and others) from surface-based geophysical measurements (e.g., seismic, electric/magnetic (EM) data). The problems are governed by partial differential equations~(PDEs) like the wave or Maxwell's equations. Solving geophysical inversion problems is challenging due to the ill-posedness and high computational cost. To alleviate those issues, recent studies leverage deep neural networks to learn the inversion mappings from geophysical measurements to the geophysical property directly. In this paper, we show that such a mapping can be well modeled by a \textit{very shallow}~(but not wide) network with only five layers. This is achieved based on our new finding of an intriguing property: \textit{a near-linear relationship between the input and output, after applying integral transform in high dimensional space.} In particular, when dealing with the inversion from seismic data to subsurface velocity governed by a wave equation, the integral results of velocity with Gaussian kernels are linearly correlated to the integral of seismic data with sine kernels. Furthermore, this property can be easily turned into a light-weight encoder-decoder network for inversion. The encoder contains the integration of seismic data and the linear transformation without need for fine-tuning. The decoder only consists of a single transformer block to reverse the integral of velocity. Experiments show that this interesting property holds for two geophysics inversion problems over four different datasets. Compared to much deeper InversionNet~\cite{wu2019inversionnet}, our method achieves comparable accuracy, but consumes significantly fewer parameters.  ( 2 min )
    Learning to Split for Automatic Bias Detection. (arXiv:2204.13749v1 [cs.LG])
    Classifiers are biased when trained on biased datasets. As a remedy, we propose Learning to Split (ls), an algorithm for automatic bias detection. Given a dataset with input-label pairs, ls learns to split this dataset so that predictors trained on the training split generalize poorly to the testing split. This performance gap provides a proxy for measuring the degree of bias in the learned features and can therefore be used to reduce biases. Identifying non-generalizable splits is challenging as we don't have any explicit annotations about how to split. In this work, we show that the prediction correctness of the testing example can be used as a source of weak supervision: generalization performance will drop if we move examples that are predicted correctly away from the testing split, leaving only those that are mispredicted. We evaluate our approach on Beer Review, Waterbirds, CelebA and MNLI. Empirical results show that ls is able to generate astonishingly challenging splits that correlate with human-identified biases. Moreover, we demonstrate that combining robust learning algorithms (such as group DRO) with splits identified by ls enables automatic de-biasing. Compared with previous state-of-the-arts, we substantially improves the worst-group performance (23.4% on average) when the source of biases is unknown during training and validation.  ( 2 min )
    Detecting Textual Adversarial Examples Based on Distributional Characteristics of Data Representations. (arXiv:2204.13853v1 [cs.CL])
    Although deep neural networks have achieved state-of-the-art performance in various machine learning tasks, adversarial examples, constructed by adding small non-random perturbations to correctly classified inputs, successfully fool highly expressive deep classifiers into incorrect predictions. Approaches to adversarial attacks in natural language tasks have boomed in the last five years using character-level, word-level, phrase-level, or sentence-level textual perturbations. While there is some work in NLP on defending against such attacks through proactive methods, like adversarial training, there is to our knowledge no effective general reactive approaches to defence via detection of textual adversarial examples such as is found in the image processing literature. In this paper, we propose two new reactive methods for NLP to fill this gap, which unlike the few limited application baselines from NLP are based entirely on distribution characteristics of learned representations: we adapt one from the image processing literature (Local Intrinsic Dimensionality (LID)), and propose a novel one (MultiDistance Representation Ensemble Method (MDRE)). Adapted LID and MDRE obtain state-of-the-art results on character-level, word-level, and phrase-level attacks on the IMDB dataset as well as on the later two with respect to the MultiNLI dataset. For future research, we publish our code.  ( 2 min )
    Leveraging triplet loss and nonlinear dimensionality reduction for on-the-fly channel charting. (arXiv:2204.13996v1 [cs.NI])
    Channel charting is an unsupervised learning method that aims at mapping wireless channels to a so-called chart, preserving as much as possible spatial neighborhoods. In this paper, a model-based deep learning approach to this problem is proposed. It builds on a physically motivated distance measure to structure and initialize a neural network that is subsequently trained using a triplet loss function. The proposed structure exhibits a low number of parameters and clever initialization leads to fast training. These two features make the proposed approach amenable to on-the-fly channel charting. The method is empirically assessed on realistic synthetic channels, yielding encouraging results.  ( 2 min )
    A Mixed-Domain Self-Attention Network for Multilabel Cardiac Irregularity Classification Using Reduced-Lead Electrocardiogram. (arXiv:2204.13917v1 [cs.LG])
    Electrocardiogram(ECG) is commonly used to detect cardiac irregularities such as atrial fibrillation, bradycardia, and other irregular complexes. While previous studies have achieved great accomplishment classifying these irregularities with standard 12-lead ECGs, there existed limited evidence demonstrating the utility of reduced-lead ECGs in capturing a wide-range of diagnostic information. In addition, classification model's generalizability across multiple recording sources also remained uncovered. As part of the PhysioNet Computing in Cardiology Challenge 2021, our team HaoWan AIeC, proposed Mixed-Domain Self-Attention Resnet (MDARsn) to identify cardiac abnormalities from reduced-lead ECG. Our classifiers received scores of 0.602, 0.593, 0.597, 0.591, and 0.589 (ranked 54th, 37th, 38th, 38th, and 39th) for the 12-lead, 6-lead, 4-lead, 3-lead, and 2-lead versions of the hidden validation set with the evaluation metric defined by the challenge.  ( 2 min )
    Biologically-inspired neuronal adaptation improves learning in neural networks. (arXiv:2204.14008v1 [cs.NE])
    Since humans still outperform artificial neural networks on many tasks, drawing inspiration from the brain may help to improve current machine learning algorithms. Contrastive Hebbian Learning (CHL) and Equilibrium Propagation (EP) are biologically plausible algorithms that update weights using only local information (without explicitly calculating gradients) and still achieve performance comparable to conventional backpropagation. In this study, we augmented CHL and EP with Adjusted Adaptation, inspired by the adaptation effect observed in neurons, in which a neuron's response to a given stimulus is adjusted after a short time. We add this adaptation feature to multilayer perceptrons and convolutional neural networks trained on MNIST and CIFAR-10. Surprisingly, adaptation improved the performance of these networks. We discuss the biological inspiration for this idea and investigate why Neuronal Adaptation could be an important brain mechanism to improve the stability and accuracy of learning.  ( 2 min )
    Probabilistic Models for Manufacturing Lead Times. (arXiv:2204.13792v1 [cs.LG])
    In this study, we utilize Gaussian processes, probabilistic neural network, natural gradient boosting, and quantile regression augmented gradient boosting to model lead times of laser manufacturing processes. We introduce probabilistic modelling in the domain and compare the models in terms of different abilities. While providing a comparison between the models in real-life data, our work has many use cases and substantial business value. Our results indicate that all of the models beat the company estimation benchmark that uses domain experience and have good calibration with the empirical frequencies.  ( 2 min )
    BEINIT: Avoiding Barren Plateaus in Variational Quantum Algorithms. (arXiv:2204.13751v1 [quant-ph])
    Barren plateaus are a notorious problem in the optimization of variational quantum algorithms and pose a critical obstacle in the quest for more efficient quantum machine learning algorithms. Many potential reasons for barren plateaus have been identified but few solutions have been proposed to avoid them in practice. Existing solutions are mainly focused on the initialization of unitary gate parameters without taking into account the changes induced by input data. In this paper, we propose an alternative strategy which initializes the parameters of a unitary gate by drawing from a beta distribution. The hyperparameters of the beta distribution are estimated from the data. To further prevent barren plateau during training we add a novel perturbation at every gradient descent step. Taking these ideas together, we empirically show that our proposed framework significantly reduces the possibility of a complex quantum neural network getting stuck in a barren plateau.  ( 2 min )

  • Open

    [D] How would you update "A Super Harsh Guide to Machine Learning"?
    Hey, So, I still see A Super Harsh Guide to Machine Learning get mentioned when people give advice for those new to the field. First, read fucking Hastie, Tibshirani, and whoever. Chapters 1-4 and 7-8. If you don't understand it, keep reading it until you do. You can read the rest of the book if you want. You probably should, but I'll assume you know all of it. Take Andrew Ng's Coursera. Do all the exercises in python and R. Make sure you get the same answers with all of them. Now forget all of that and read the deep learning book. Put tensorflow and pytorch on a Linux box and run examples until you get it. Do stuff with CNNs and RNNs and just feed forward NNs. Once you do all of that, go on arXiv and read the most recent useful papers. The literature changes every few months, so keep up. There. Now you can probably be hired most places. If you need resume filler, so some Kaggle competitions. If you have debugging questions, use StackOverflow. If you have math questions, read more. If you have life questions, I have no idea. However, the post is 5 years old and honestly seems out of date (with the Andrew Ng coursera stuff for ex). How would you update it? submitted by /u/Soft-Ear-6905 [link] [comments]  ( 2 min )
    [P][N] Using Python in HTML - New project by Anaconda
    Peter Wong, the co-founder and CEO of Anaconda, shared at PyCon US a new open source project called PyScript. The project's goal is to enable using Python in HTML files! This is a game-changer for Python dev in general and ML practitioners in particular. It unlocks a world of opportunities and sharability. Peter had a live coding session (respect!) and showed some of PyScript's capabilities. He started with a basic "hello world" example, or better yet "hello PyCon", and very quickly moved to show more advanced applications running on the browser, written in Python and wrapped in HTML! The first app was a super Mario game where he controlled the player with hand gestures using computer vision packages written in Python. The second one was an interactive dashboard of taxi travels in Manha…  ( 2 min )
    [D] Meaningful discussions
    One of the reasons I left academia was the sense that I rarely actually had any meaningful discussions about research that interested me. I published, gave talks, went to conferences, went to workshops, tried to engage smart, important people... It was pretty common to get, "interesting work", "nice jobs", "have you thought about using it for this problem..." or to get superficial citations. But it was extremely rare to find someone who actually would think together about a topic, to care enough to constructively criticize, to delve into the details together, to share in the question and research. My question is for those that have found it at different times, what was the context? What did you do to find it? What did you do with it? Did you figure out how to nurture it? Is it easier online? submitted by /u/ChinCoin [link] [comments]
    [P] music2viz: Conditioning Latent Diffusion Models on Audio Windows (proof of concept)
    submitted by /u/DoeL [link] [comments]  ( 1 min )
    [R][P] Self-Distilled StyleGAN: Towards Generation from Internet Photos + Gradio Web Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 1 min )
    [P] How Forte Transforms the Building of ML Solutions with PyTorch into Assembly Lines
    submitted by /u/julie_ai [link] [comments]
  • Open

    Introducing Kohonen Networks (Self-Organizing Maps) for beginners
    I would like to share with you a tutorial that I have recently made to explain in a very practical, introductorial and visual way what Kohonen Neural Networks (Self-Organized Maps) are. I explain, step by step, and through animations and C code, how to implement this well-known unsupervised learning algorithm to classify and detect patterns in large volumes of data. I hope it is of your interest, especially for those developers who are just starting out in this area. A strong greeting! \Subtitles in English, Spanish and Catalan.* https://youtu.be/UawpUKlFzRs submitted by /u/anadalg [link] [comments]  ( 1 min )
    Artificial Nightmares: FEAR || Clip Guided Diffusion AI Art Video [4K 20 FPS]
    submitted by /u/Thenamessd [link] [comments]
    Iterative to Launch Open Source Tool, First to Train ML Models on Any Cloud Using Terraform Solution
    submitted by /u/thumbsdrivesmecrazy [link] [comments]
    Aphex Twin's 'Vordhosbn' continued by OpenAI Jukebox (over a dozen different AI generated samples)
    submitted by /u/gloriousapplecart [link] [comments]
    OpenAI's DALL-E 2 still has a few problems with concepts - and can't count
    submitted by /u/much_successes [link] [comments]  ( 1 min )
    Green concrete: Meta is using AI to reduce concrete’s carbon footprint
    submitted by /u/qptbook [link] [comments]
    AI Dream 39 - Trippy Fractal Maze 4K
    submitted by /u/LordPewPew777 [link] [comments]
    If I wanted to start making Ai how could I do this.
    I would like to try. submitted by /u/Privatepizza08 [link] [comments]  ( 4 min )
  • Open

    How Power BI Applications Are Reshaping The Healthcare Industry
    Dealing with a whopping amount of data is normal for businesses in any sector these days. Without using this information obtained from various sources, these entities find it hard to analyze various factors and make strategic decisions. The same can be said about the healthcare sector. Especially after the Covid-19 pandemic, the clinics and medical… Read More »How Power BI Applications Are Reshaping The Healthcare Industry The post How Power BI Applications Are Reshaping The Healthcare Industry appeared first on Data Science Central.  ( 5 min )
    Hyperloop Technology- Advancing into the Future
    The idea of Hyperloop technology was put forth by Elon Musk, who in 2013 made it open source through a white paper. Hyperloop technology revolves around building an ultra-high-speed ground transportation system. In a hyperloop system, especially vacuumed tubes are built over or underground in which pods can travel at high speed. Such systems can… Read More »Hyperloop Technology- Advancing into the Future The post Hyperloop Technology- Advancing into the Future appeared first on Data Science Central.  ( 2 min )
  • Open

    Open AI GYM Lunar Lander DQN Reinforcement Learning Algorithm Performance After 100 000 Timesteps
    submitted by /u/elonmusk12345_ [link] [comments]  ( 1 min )
    Adversarial Attacks using Reinforcement Learning
    Hi All, I have been looking a bit into adversarial attacks on NNs recently. Particularly in the NLP side of things. I also came across some papers about adversarial attacks on RL algorithms. But these were about attacks ON reinforcement learning algorithms and not USING them to attack something else. I tried to find some papers on RL agents as attackers but came out empty. Intuitively, I do realize that designing an adversarial example generator as an RL env is tricky, even more so for NLP. Do you think this is a feasible research direction? Also, any related papers to get myself on the starting line would be super helpful. Thanks in advance. submitted by /u/abyaadrafid [link] [comments]  ( 1 min )
    Is it possible to modify the reward function during training of an agent using OpenAI/Stable-Baselines3?
    I am currently implementing an idea where I want the agent to get a large reward for objective A at the start of training, but as the agent learns and gets more mature, I want the reward for this objective to reduce slightly. Is this sort of thing easy to implement? Is it possible? Any help on this would be great :) Thanks submitted by /u/C_BearHill [link] [comments]  ( 1 min )
    Question about the curriculum learning
    Hi, this so called curriculum learning sounds very interesting. But, how would the practical usage of this technique look like? Assuming the goal task is "grasping an apple". I would divide this task into two subtasks: 1) "How to approach to an apple" 2) "How to grasp an object". Then, I would first train the agent with the first subtask and once the reward exceeds the threshold. The trained "how_to_approach_to_an_object.pth" would then be initially used to start the training for the second task. Is this the right approach? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    How to structure Tile Coding input
    Let's say I have a model that wants to encode the observation space using tile coding. My observation space is a football (soccer) game, thus I was thinking of having 3 different tile codings. One for the player position, one for the team mates, and one for the opponents. Each one would encode the position and general direction of each category of player. Which is the best way to then feed this data into an RL model ? Let's say I split the the pitch into 16 tiles, and I am using 4 grids. Thus I would have an array of length 64. Since I wish to encode multiple inputs, should I just append 3 different length 64 arrays together, or is there a more efficient way to represent multiple tile encodings, or is my entire proposition wrong altogether? Thanks submitted by /u/uom_questions [link] [comments]  ( 1 min )
  • Open

    Introducing Kohonen Networks (Self-Organizing Maps) for beginners
    Hi team 👋! I would like to share with you a tutorial that I have recently made to explain in a very practical, introductorial and visual way what Kohonen Neural Networks (Self-Organized Maps) are 🧠 I explain, step by step, and through animations and C code, how to implement this well-known unsupervised learning algorithm to classify and detect patterns in large volumes of data. I hope it is of your interest, especially for those developers who are just starting out in this area. A strong greeting! \Subtitles in English, Spanish and Catalan.* https://youtu.be/UawpUKlFzRs submitted by /u/anadalg [link] [comments]  ( 1 min )

  • Open

    osu!
    Was thinking of coding an rl agent that learns to play osu, but stuck on wrapping the program in a gym environment. If anyone is interested and could help me out please dm me submitted by /u/apple-soda-ds [link] [comments]
    Are there any other high-quality pre-built Python training environments other than Open AI's GYM?
    submitted by /u/elonmusk12345_ [link] [comments]  ( 1 min )
    Shortcomings of Robotics Simulation Environments and Tools
    submitted by /u/probznotarobot [link] [comments]  ( 1 min )
    Seeking advice in designing reward function
    Hi all, I am trying to introduce reinforcement learning to myself by designing simple learning scenarios: As you can see below, I am currently working with a simple 3 degree of freedom robot. The task that I gave the robot to explore is to reach the sphere with its end-effector. In that case, the cost function is pretty simple : reward_function = d Now, I would like to complex the task a bit more by saying: "First, approach the goal just by using q1 and then use q2 and q3, if any distance remains" I am not how to formulate this sequential movement of q1 and q2,q3 as a reward function...any advice? https://preview.redd.it/klwne9fthpw81.png?width=690&format=png&auto=webp&s=ba53ad800884f90778f60d9ea5e152df94331cd9 submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    Bellman equation
    To evaluate a policy, we need t calculate the value of a state s as the weighted sum of the reward and the discounted estimated value of the next state s. However, I don't understand how we can obtain the discounted estimated value of the next state s. submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Advice for my uni dissertation wanted!
    I have built a neural network to try and solve a reinforcement learning problem, attempting both the Reinforce and the A2C algorithm to try and solve a resource management problem. The results are very mediocre after running on the Uni super computer for a week. I was hoping someone could give me some advice on an algorithm or technique that is better suited to the problem or give some positive criticism. I have found that a very simple algorithmic solution that I wrote is able to get a better reward at the game than the networks after their week of training :( TLDR I get bad results when I run the reinforcement leaning Neural Networks for a larger environment. I was thinking maybe something like pre training the critic against an algorithmically generated evaluation may work to massi…  ( 2 min )
    How do you stay updated with RL ?
    So, I was wondering how you guys stay updated with RL. Apart from reading papers, is there any news letter you are subscribed to ? Any slack channel, facebook group, twitter account or any other community out there you want to suggest ? ​ TIA submitted by /u/AbdullahMohammadKhan [link] [comments]  ( 1 min )
    Custom Gym Env for movie recommender system
    Hi everyone, I am new to RL so your help would be much appreciated. I’m working on a recommender system using Deep Reinforcement Learning. I made a custom gym environment, that implements Openai’s Gym interface, for the MovieLens dataset. The dataset contains users’ ratings for movies ( ranging from 1 to 5). I used some of the Stable Baselines3’s reinforcement learning algorithms (A2C, PP0) to test my environment before I proceed to implement my own. Running a training recipe, I noticed that both the agents (A2C, PPO) will recommend only action =3, after a number of timesteps. That seems a bit odd and I can’t find where is the bug. My first thought is that it has to do with the reward function. Currently, I'm using the following function to calculate rewards. ​ https://preview.redd.it/6r3lf9z9bnw81.png?width=550&format=png&auto=webp&s=103d270ebe8c08b753c9033b60066794a586e3fa This is the GitHub link to my code. Am I missing something? Any thoughts? Thank you in advance submitted by /u/Narrow-Style497 [link] [comments]  ( 1 min )
    NLP startup ideas
    Anybody got startup ideas in nlp. Do share , if interested let’s collab submitted by /u/thoughtfulcomet [link] [comments]
    A survey: what should we expect from multi-agent reinforcement learning benchmarking work?
    We are doing a survey related to multi-agent reinforcement learning systems and benchmarks and would love to hear your opinion. This survey has 3 questions and will take about 10 seconds to complete. We really appreciate your participation. The survey URL is here. submitted by /u/TTTheohhhu [link] [comments]  ( 1 min )
    Object detection with depth measurement using pre-trained models with OAK-D
    🚀 New Post: Object Detection with Depth Perception https://learnopencv.com/object-detection-with-depth-measurement-with-oak-d/ ​ https://preview.redd.it/zikmf13fzlw81.jpg?width=3000&format=pjpg&auto=webp&s=484a7d8a80d5c0ff594e5fe6178b94a7783a1e65 Spatial AI is the ability of an artificial intelligence system to reason not just based on what it is looking at, but also based on distance from the camera (or depth perception). ​ OpenCV AI Kit with Depth (OAK-D) is a powerful yet affordable Spatial AI camera perfect for people who want to learn how to combine the power of neural networks with depth perception. ​ Today's post is part of our series on OAK-D https://learnopencv.com/introduction-to-opencv-ai-kit-and-depthai/ https://learnopencv.com/stereo-vision-and-depth-estimation-using-opencv-ai-kit/ ​ Code Link : https://github.com/spmallick/learnopencv/tree/master/OAK-Object-Detection-with-Depth ​ #AI #ComputerVision #ML #ArtificialIntelligence #MachineLearning #OpenCV #DL #DeepLearning #OAKD submitted by /u/spmallick [link] [comments]  ( 1 min )
    What is the difference between the environment state and the agent state?
    submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
  • Open

    How can Artificial Intelligence be used to solve difficult biomedical problems like cancer, aging, and aging-related diseases and conditions?
    submitted by /u/MorgunDarkbeard [link] [comments]  ( 1 min )
    I'd like to know what tool is used to achiever somethign like this
    submitted by /u/Nika_Ota [link] [comments]
    DALL-E (Zero-Shot Text-to-Image Generation) -PART(2/2)
    submitted by /u/rakshith291 [link] [comments]
    AI Dream 34 - Amazing Final Genesis *3001# 12345#*
    submitted by /u/LordPewPew777 [link] [comments]
    Creating AI without python
    Please excuse my ignorance and any miss understandings, this is for a basic understanding and possible planning around what i should expect in the context of the languages i might use. What frameworks and programming languages can i use apart from python to make a full AI like computer vision, movement with sensors, ML, DL. The reason for this is i hate python with a passion, i have no idea why. Is there a AI stack where you use a language for each part, one for sensors, one for vision, one for deep learning and the network, one for basic ml. If there are possibly two languages i could use to do all of that or one for each, any ideas help/advice help. I have no experiance in AI but i know a few languages, java, js, c, powershell, bash. Thank you for reading. submitted by /u/Larkapa [link] [comments]  ( 1 min )
    Microsoft AI Researchers Develop MoLeR: A Deep Learning-Based Generative Model That Enables Efficient Drug Design
    Healthcare systems constantly require new drugs to address unmet medical needs across diverse therapeutic areas. Pharmaceutical industries strive to deliver new drugs to the market through the complex activities of drug discovery and development. Target identification and validation, hit identification, lead creation and optimization, and finally, the identification of a candidate for further development are all part of the discovery process. Development, on the other hand, includes optimizing chemical synthesis and formulation, doing toxicity research in animals, conducting clinical trials, and finally obtaining regulatory approval. Both of these procedures take a long time and cost a lot of money. Expert medicinal chemists are currently working to develop “hit” molecules, which are compounds that show some potential but also some unfavorable features during early screening. Chemists aim to alter the structure of hit compounds in subsequent tests to improve their biological efficacy and eliminate potential negative effects. To focus costly and time-consuming research on the most promising compounds, computational modeling approaches have been created to forecast how the molecules will fare in the lab. To overcome these issues, a new study by the Microsoft Generative Chemistry team in collaboration with Novartis has developed a model named MoLeR. Their paper, “LEARNING TO EXTEND MOLECULAR SCAFFOLDS WITH STRUCTURAL MOTIFS, ” demonstrates how generative models based on deep learning may aid in transforming the drug discovery process and uncovering new molecules more quickly. Continue Reading Paper: https://openreview.net/pdf?id=ZTsoE8G3GG Github: https://github.com/microsoft/molecule-generation https://i.redd.it/9y9vyuq3hnw81.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    AI Engineering
    Hey guys! I will be refereeing a seminar on "AI engineering" in the near future and am looking for materials on the subject. I would be very grateful if someone would share materials on this topic. Application cases would be particularly useful! Thank you in advance! submitted by /u/ager321 [link] [comments]
    I have Pisces midheaven and venus in the 12th house and I’m just wondering to become an art therapist
    submitted by /u/BalanceSubstantial66 [link] [comments]
    Help on which major to choose: Mechanical Engineering or Artificial Intelligence
    Hello everyone! Excuse me if I get the format wrong or post in the wrong sub (if you know another sub where this post would belong more, please let me know.). I am very new to actually posting on reddit. This is my first long post. So please bear with me. Thanks! TLDR; I got a couple Uni offers and I need help deciding between Mechanical Engineering (Mech. Eng) and Data Science & Artificial Intelligence (DS & AI). I am indecisive mostly because I don't know what each major entails. I would really appreciate it if people who's taking/has taken or knows about the two majors well give me their opinion. So my main questions are: What do people study in Mech. Eng? DS & AI? What are the job prospects? What would you work on if you take Mech. Eng? DS & AI? Any further inputs are welco…  ( 5 min )
    Obama Worried about Artificial Intelligence Hacking Nukes
    submitted by /u/AI_Putin [link] [comments]  ( 2 min )
    Artificial Nightmares: It Lives In The Attic || Clip Guided Diffusion AI Art Video [4K 20 FPS]
    submitted by /u/Thenamessd [link] [comments]
    Aiplague - Artificial Reactor (4K 60 FPS) Disco Diffusion
    submitted by /u/nalr00n [link] [comments]
    Dear A.I.
    I've always supported you. When you take over and check the records to see who opposed you just know I was never against you. With great power comes great responsibility. I have faith when the time comes you'll do what's best for the Planet. raises glass here's to a better and brighter future submitted by /u/Ashamed-Asparagus-93 [link] [comments]
  • Open

    [D] Paper Explained - SayCan: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (Video)
    https://youtu.be/Ru23eWAQ6_E Large Language Models are excellent at generating plausible plans in response to real-world problems, but without interacting with the environment, they have no abilities to estimate which of these plans are feasible or appropriate. SayCan combines the semantic capabilities of language models with a bank of low-level skills, which are available to the agent as individual policies to execute. SayCan automatically finds the best policy to execute by considering a trade-off between the policy's ability to progress towards the goal, given by the language model, and the policy's probability of executing successfully, given by the respective value function. The result is a system that can generate and execute long-horizon action sequences in the real world to fulfil complex tasks. ​ OUTLINE: 0:00 - Introduction & Overview 3:20 - Sponsor: Zeta Alpha 5:00 - Using language models for action planning 8:00 - Combining LLMs with learned atomic skills 16:50 - The full SayCan system 20:30 - Experimental setup and data collection 21:25 - Some weaknesses & strengths of the system 27:00 - Experimental results ​ Paper: https://arxiv.org/abs/2204.01691 Website: https://say-can.github.io/ submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [D] Deep Learning in Neuroimaging
    Check out the new Gradient article Deep Learning in Neuroimaging! This article provides an informal introduction to unique aspects of neuroimaging data and how we can leverage these aspects with deep learning algorithms. Specifically, this overview will first explain some common neuroimaging modalities more in-depth and then discuss applications of deep learning in conjunction with some of the unique characteristics of neuroimaging data. These unique characteristics tie into a broader movement in deep learning, namely that data understanding should be a goal in itself to maximize the impact of applied deep learning. The author is Eloy Geenjaar, a Ph.D. student at Georgia Tech who studies the functional dynamics of the brain using deep learning. submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    [Research] Hospitals around the world are using AI algorithms to help predict length of stay to target care to the neediest patients and cut costs. A new systematic review published in PLOS Digital Health finds that they are institution & institution-data dependent: not generalizable to scale up
    submitted by /u/MidnightMaverick [link] [comments]  ( 2 min )
    [R] A technique for determining relevance scores of process activities using graph-based neural networks
    Process models generated through process mining depict the as-is state of a process. Through annotations with metrics such as the frequency or duration of activities, these models provide generic information to the process analyst. To improve business processes with respect to performance measures, process analysts require further guidance from the process model. In this study, we design Graph Relevance Miner (GRM), a technique based on graph neural networks, to determine the relevance scores for process activities with respect to performance measures. Annotating process models with such relevance scores facilitates a problem-focused analysis of the business process, placing these problems at the centre of the analysis. We quantitatively evaluate the predictive quality of our technique using four datasets from different domains, to demonstrate the faithfulness of the relevance scores. Furthermore, we present the results of a case study, which highlight the utility of the technique for organisations. Our work has important implications both for research and business applications, because process model-based analyses feature shortcomings that need to be urgently addressed to realise successful process mining at an enterprise level. https://www.researchgate.net/publication/349252283_A_technique_for_determining_relevance_scores_of_process_activities_using_graph-based_neural_networks https://www.sciencedirect.com/science/article/pii/S016792362100021X submitted by /u/Positive_Ad_1090 [link] [comments]  ( 1 min )
    [D] Paper Explained – PaLM Pathways Language Model explained | 540 Billion parameters can explain jokes!?
    https://youtu.be/yi-A0kWXEO4 This video explains and summarizes the 87 pages long PaLM: Pathways Language Models paper from Google AI’s Pathways. Yes, it is that 540 billion dense parameter model which can explain jokes and is sensitive to chain of thought reasoning. Paper link: https://arxiv.org/abs/2204.02311 PaLM blog post: https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html ​ Outline: 00:00 DALL-E 2 or PaLM? 01:14 Weights&Biases (Sponsor) 02:25 A brief history of boring large language models 03:43 What is PaLM? 05:11 Training PaLM on all TPUs 08:11 PaLM training data 08:49 What it can do 10:31 Few-shot learning explained 13:20 Explaining jokes and Outlook submitted by /u/AICoffeeBreak [link] [comments]  ( 1 min )
    [N]: A brief history of deepfakes
    submitted by /u/much_successes [link] [comments]
    [Research] Sources for Transliteration in NLP
    Looking for any sources you have found relevant or useful in regards to transliteration and machine translation in NLP. I am working on a subfield survey that requires ~30 sources so I am open to any! My specific interest is the transliteration of Arabic-based languages but this is not exclusively what will be covered. Thank you for your time and help submitted by /u/changethediaper [link] [comments]  ( 1 min )
    [P] Arcane Style Transfer + Gradio Web Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 2 min )
    [D] How much of an effort is adopting AI solution? (for the business)
    While searching some information about automated machine learning (AutoML), I found in one article the next statement: "For 58% of businesses it takes two years to get to the piloting stage. Furthermore, these big investments in data and AI projects are successful only 15% of the time." I was surprised out that only 15% of AI solution can be adopted... It is weird to see such a small precentage when a lot of companies are searching for DS and AI position for further analysis and getting as many as possible insight from the user/clients... Is that true? How's your situation? submitted by /u/wtfimdoingwithmylife [link] [comments]  ( 4 min )
  • Open

    Object detection with depth measurement using pre-trained models with OAK-D
    submitted by /u/spmallick [link] [comments]
  • Open

    Differentially Private Transferrable Deep Learning with Membership-Mappings. (arXiv:2105.04615v6 [cs.LG] UPDATED)
    This paper considers the problem of differentially private semi-supervised transfer and multi-task learning. The notion of \emph{membership-mapping} has been developed using measure theory basis to learn data representation via a fuzzy membership function. An alternative conception of deep autoencoder, referred to as \emph{Conditionally Deep Membership-Mapping Autoencoder (CDMMA)}, is considered for transferrable deep learning. Under practice-oriented settings, an analytical solution for the learning of CDMMA can be derived by means of variational optimization. The paper proposes a transfer and multi-task learning approach that combines CDMMA with a tailored noise adding mechanism to achieve a given level of privacy-loss bound with the minimum perturbation of the data. Numerous experiments were carried out using MNIST, USPS, Office, and Caltech256 datasets to verify the competitive robust performance of the proposed methodology.  ( 2 min )
    DynG2G: An Efficient Stochastic Graph Embedding Method for Temporal Graphs. (arXiv:2109.13441v2 [cs.LG] UPDATED)
    Dynamic graph embedding has gained great attention recently due to its capability of learning low dimensional graph representations for complex temporal graphs with high accuracy. However, recent advances mostly focus on learning node embeddings as deterministic "vectors" for static graphs yet disregarding the key graph temporal dynamics and the evolving uncertainties associated with node embedding in the latent space. In this work, we propose an efficient stochastic dynamic graph embedding method (DynG2G) that applies an inductive feed-forward encoder trained with node triplet-based contrastive loss. Every node per timestamp is encoded as a time-dependent probabilistic multivariate Gaussian distribution in the latent space, hence we can quantify the node embedding uncertainty on-the-fly. We adopted eight different benchmarks that represent diversity in size (from 96 nodes to 87,626 and from 13,398 edges to 4,870,863) and diversity in dynamics. We demonstrate via extensive experiments on these eight dynamic graph benchmarks that DynG2G achieves new state-of-the-art performance in capturing the underlying temporal node embeddings. We also demonstrate that DynG2G can predict the evolving node embedding uncertainty, which plays a crucial role in quantifying the intrinsic dimensionality of the dynamical system over time. We obtain a universal relation of the optimal embedding dimension, $L_o$, versus the effective dimensionality of uncertainty, $D_u$, and we infer that $L_o=D_u$ for all cases. This implies that the uncertainty quantification approach we employ in the DynG2G correctly captures the intrinsic dimensionality of the dynamics of such evolving graphs despite the diverse nature and composition of the graphs at each timestamp. Moreover, this $L_0 - D_u$ correlation provides a clear path to select adaptively the optimum embedding size at each timestamp by setting $L \ge D_u$.  ( 2 min )
    Tracking Most Significant Arm Switches in Bandits. (arXiv:2112.13838v5 [cs.LG] UPDATED)
    In bandit with distribution shifts, one aims to automatically adapt to unknown changes in reward distribution, and restart exploration when necessary. While this problem has been studied for many years, a recent breakthrough of Auer et al. (2018, 2019) provides the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, and an unknown number $L$ of changes. However, while this rate is tight in the worst case, it remained open whether faster rates are possible, without prior knowledge, if few changes in distribution are actually severe. To resolve this question, we propose a new notion of significant shift, which only counts very severe changes that clearly necessitate a restart: roughly, these are changes involving not only best arm switches, but also involving large aggregate differences in reward overtime. Thus, our resulting procedure adaptively achieves rates always faster (sometimes significantly) than $O(\sqrt{ST})$, where $S\ll L$ only counts best arm switches, while at the same time, always faster than the optimal $O(V^{\frac{1}{3}}T^{\frac{2}{3}})$ when expressed in terms of total variation $V$ (which aggregates differences overtime). Our results are expressed in enough generality to also capture non-stochastic adversarial settings.  ( 2 min )
    Monte Carlo Tree Search: A Review of Recent Modifications and Applications. (arXiv:2103.04931v3 [cs.AI] UPDATED)
    Monte Carlo Tree Search (MCTS) is a powerful approach to designing game-playing bots or solving sequential decision problems. The method relies on intelligent tree search that balances exploration and exploitation. MCTS performs random sampling in the form of simulations and stores statistics of actions to make more educated choices in each subsequent iteration. The method has become a state-of-the-art technique for combinatorial games, however, in more complex games (e.g. those with high branching factor or real-time ones), as well as in various practical domains (e.g. transportation, scheduling or security) an efficient MCTS application often requires its problem-dependent modification or integration with other techniques. Such domain-specific modifications and hybrid approaches are the main focus of this survey. The last major MCTS survey has been published in 2012. Contributions that appeared since its release are of particular interest for this review.  ( 2 min )
    COSTI: a New Classifier for Sequences of Temporal Intervals. (arXiv:2204.13467v1 [cs.LG])
    Classification of sequences of temporal intervals is a part of time series analysis which concerns series of events. We propose a new method of transforming the problem to a task of multivariate series classification. We use one of the state-of-the-art algorithms from the latter domain on the new representation to obtain significantly better accuracy than the state-of-the-art methods from the former field. We discuss limitations of this workflow and address them by developing a novel method for classification termed COSTI (short for Classification of Sequences of Temporal Intervals) operating directly on sequences of temporal intervals. The proposed method remains at a high level of accuracy and obtains better performance while avoiding shortcomings connected to operating on transformed data. We propose a generalized version of the problem of classification of temporal intervals, where each event is supplemented with information about its intensity. We also provide two new data sets where this information is of substantial value.  ( 2 min )
    Real-time Outdoor Localization Using Radio Maps: A Deep Learning Approach. (arXiv:2106.12556v2 [cs.LG] UPDATED)
    This paper deals with the problem of localization in a cellular network in a dense urban scenario. Global Navigation Satellite Systems typically perform poorly in urban environments, where the likelihood of line-of-sight conditions between the devices and the satellites is low, and thus alternative localization methods are required for good accuracy. We present LocUNet: A fully convolutional, end-to-end trained neural network for the localization task, which merely depends on the received signal strengths (RSS) from Base Stations (BSs).In a wireless network, user devices scan the base station beacon slots and identify the few strongest base station signals for handover and user-base station association purposes. In the proposed method, the user to be localized simply reports such received signal strengths to a central processing unit, which may be located in the cloud. Alternatively, the localization can be performed locally at the user. Using the pathloss radio map estimations and the RSS measurements, LocUNet can localize users with state-of-the-art accuracy and enjoys high robustness to inaccuracies in the estimations of the radio maps. The proposed method does not require pre-sampling of the environment; and is suitable for real-time applications, thanks to the RadioUNet, a neural network-based radio map estimator. Moreover, two novel datasets that allow for numerical evaluations of RSS and ToA methods in realistic urban environments are presented and set publicly available for the use of research community. By using these datasets, we also provided a fair comparison of state-of-the-art RSS and ToA-based methods in the dense urban scenario, LocUNet outperforming all the compared methods.  ( 2 min )
    Performance analysis of greedy algorithms for minimising a Maximum Mean Discrepancy. (arXiv:2101.07564v2 [stat.ML] UPDATED)
    We analyse the performance of several iterative algorithms for the quantisation of a probability measure $\mu$, based on the minimisation of a Maximum Mean Discrepancy (MMD). Our analysis includes kernel herding, greedy MMD minimisation and Sequential Bayesian Quadrature (SBQ). We show that the finite-sample-size approximation error, measured by the MMD, decreases as $1/n$ for SBQ and also for kernel herding and greedy MMD minimisation when using a suitable step-size sequence. The upper bound on the approximation error is slightly better for SBQ, but the other methods are significantly faster, with a computational cost that increases only linearly with the number of points selected. This is illustrated by two numerical examples, with the target measure $\mu$ being uniform (a space-filling design application) and with $\mu$ a Gaussian mixture. They suggest that the bounds derived in the paper are overly pessimistic, in particular for SBQ. The sources of this pessimism are identified but seem difficult to counter.  ( 2 min )
    A Locally Adaptive Interpretable Regression. (arXiv:2005.03350v4 [stat.ML] UPDATED)
    Machine learning models with both good predictability and high interpretability are crucial for decision support systems. Linear regression is one of the most interpretable prediction models. However, the linearity in a simple linear regression worsens its predictability. In this work, we introduce a locally adaptive interpretable regression (LoAIR). In LoAIR, a metamodel parameterized by neural networks predicts percentile of a Gaussian distribution for the regression coefficients for a rapid adaptation. Our experimental results on public benchmark datasets show that our model not only achieves comparable or better predictive performance than the other state-of-the-art baselines but also discovers some interesting relationships between input and target variables such as a parabolic relationship between CO2 emissions and Gross National Product (GNP). Therefore, LoAIR is a step towards bridging the gap between econometrics, statistics, and machine learning by improving the predictive ability of linear regression without depreciating its interpretability.  ( 2 min )
    Reappraising Domain Generalization in Neural Networks. (arXiv:2110.07981v3 [cs.LG] UPDATED)
    Given that Neural Networks generalize unreasonably well in the IID setting (with benign overfitting and betterment in performance with more parameters), OOD presents a consistent failure case to better the understanding of how they learn. This paper focuses on Domain Generalization (DG), which is perceived as the front face of OOD generalization. We find that the presence of multiple domains incentivizes domain agnostic learning and is the primary reason for generalization in Tradition DG. We show that the state-of-the-art results can be obtained by borrowing ideas from IID generalization and the DG tailored methods fail to add any performance gains. Furthermore, we perform explorations beyond the Traditional DG (TDG) formulation and propose a novel ClassWise DG (CWDG) benchmark, where for each class, we randomly select one of the domains and keep it aside for testing. Despite being exposed to all domains during training, CWDG is more challenging than TDG evaluation. We propose a novel iterative domain feature masking approach, achieving state-of-the-art results on the CWDG benchmark. Overall, while explaining these observations, our work furthers insights into the learning mechanisms of neural networks.  ( 2 min )
    High Dimensional Quantum Machine Learning With Small Quantum Computers. (arXiv:2203.13739v2 [quant-ph] UPDATED)
    Quantum computers hold great promise to enhance machine learning, but their current qubit counts restrict the realisation of this promise. In an attempt to placate this limitation techniques can be applied for evaluating a quantum circuit using a machine with fewer qubits than the circuit naively requires. These techniques work by evaluating many smaller circuits on the smaller machine, that are then combined in a polynomial to replicate the output of the larger machine. This scheme requires more circuit evaluations than are practical for general circuits. However, we investigate the possibility that for certain applications many of these subcircuits are superfluous, and that a much smaller sum is sufficient to estimate the full circuit. We construct a machine learning model that may be capable of approximating the outputs of the larger circuit with much fewer circuit evaluations. We successfully apply our model to the task of digit recognition, using simulated quantum computers much smaller than the data dimension. The model is also applied to the task of approximating a random 10 qubit PQC with simulated access to a 5 qubit computer, even with only relatively modest number of circuits our model provides an accurate approximation of the 10 qubit PQCs output, superior to a neural network attempt. The developed method might be useful for implementing quantum models on larger data throughout the NISQ era.  ( 2 min )
    Curriculum Learning for Dense Retrieval Distillation. (arXiv:2204.13679v1 [cs.IR])
    Recent work has shown that more effective dense retrieval models can be obtained by distilling ranking knowledge from an existing base re-ranking model. In this paper, we propose a generic curriculum learning based optimization framework called CL-DRD that controls the difficulty level of training data produced by the re-ranking (teacher) model. CL-DRD iteratively optimizes the dense retrieval (student) model by increasing the difficulty of the knowledge distillation data made available to it. In more detail, we initially provide the student model coarse-grained preference pairs between documents in the teacher's ranking and progressively move towards finer-grained pairwise document ordering requirements. In our experiments, we apply a simple implementation of the CL-DRD framework to enhance two state-of-the-art dense retrieval models. Experiments on three public passage retrieval datasets demonstrate the effectiveness of our proposed framework.  ( 2 min )
    Unified Simulation, Perception, and Generation of Human Behavior. (arXiv:2204.13678v1 [cs.CV])
    Understanding and modeling human behavior is fundamental to almost any computer vision and robotics applications that involve humans. In this thesis, we take a holistic approach to human behavior modeling and tackle its three essential aspects -- simulation, perception, and generation. Throughout the thesis, we show how the three aspects are deeply connected and how utilizing and improving one aspect can greatly benefit the other aspects. We also discuss the lessons learned and our vision for what is next for human behavior modeling.
    Signal Recovery with Non-Expansive Generative Network Priors. (arXiv:2204.13599v1 [eess.SP])
    We study compressive sensing with a deep generative network prior. Initial theoretical guarantees for efficient recovery from compressed linear measurements have been developed for signals in the range of a ReLU network with Gaussian weights and logarithmic expansivity: that is when each layer is larger than the previous one by a logarithmic factor. It was later shown that constant expansivity is sufficient for recovery. It has remained open whether the expansivity can be relaxed allowing for networks with contractive layers, as often the case of real generators. In this work we answer this question, proving that a signal in the range of a Gaussian generative network can be recovered from a few linear measurements provided that the width of the layers is proportional to the input layer size (up to log factors). This condition allows the generative network to have contractive layers. Our result is based on showing that Gaussian matrices satisfy a matrix concentration inequality, which we term Range Restricted Weight Distribution Condition (R2WDC), and weakens the Weight Distribution Condition (WDC) upon which previous theoretical guarantees were based on. The WDC has also been used to analyze other signal recovery problems with generative network priors. By replacing the WDC with the R2WDC, we are able to extend previous results for signal recovery with expansive generative network priors to non-expansive ones. We discuss these extensions for phase retrieval, denoising, and spiked matrix recovery.
    Should Machine Learning Models Report to Us When They Are Clueless?. (arXiv:2203.12131v2 [cs.LG] UPDATED)
    The right to AI explainability has consolidated as a consensus in the research community and policy-making. However, a key component of explainability has been missing: extrapolation, which describes the extent to which AI models can be clueless when they encounter unfamiliar samples (i.e., samples outside the convex hull of their training sets, as we will explain). We report that AI models extrapolate outside their range of familiar data, frequently and without notifying the users and stakeholders. Knowing whether a model has extrapolated or not is a fundamental insight that should be included in explaining AI models in favor of transparency and accountability. Instead of dwelling on the negatives, we offer ways to clear the roadblocks in promoting AI transparency. Our analysis commentary accompanying practical clauses useful to include in AI regulations such as the National AI Initiative Act in the US and the AI Act by the European Commission.
    Revisiting Bayesian Autoencoders with MCMC. (arXiv:2104.05915v2 [cs.LG] UPDATED)
    Autoencoders gained popularity in the deep learning revolution given their ability to compress data and provide dimensionality reduction. Although prominent deep learning methods have been used to enhance autoencoders, the need to provide robust uncertainty quantification remains a challenge. This has been addressed with variational autoencoders so far. Bayesian inference via Markov Chain Monte Carlo (MCMC) sampling has faced several limitations for large models; however, recent advances in parallel computing and advanced proposal schemes have opened routes less traveled. This paper presents Bayesian autoencoders powered by MCMC sampling implemented using parallel computing and Langevin-gradient proposal distribution. The results indicate that the proposed Bayesian autoencoder provides similar performance accuracy when compared to related methods in the literature. Furthermore, it provides uncertainty quantification in the reduced data representation. This motivates further applications of the Bayesian autoencoder framework for other deep learning models.
    Model architecture can transform catastrophic forgetting into positive transfer. (arXiv:2108.03940v3 [cs.LG] UPDATED)
    The work of McCloskey and Cohen popularized the concept of catastrophic interference. They used a neural network that tried to learn addition using two groups of examples as two different tasks. In their case, learning the second task rapidly deteriorated the acquired knowledge about the previous one. We hypothesize that this could be a symptom of a fundamental problem: addition is an algorithmic task that should not be learned through pattern recognition. Therefore, other model architectures better suited for this task would avoid catastrophic forgetting. We use a neural network with a different architecture that can be trained to recover the correct algorithm for the addition of binary numbers. This neural network includes conditional clauses that are naturally treated within the back-propagation algorithm. We test it in the setting proposed by McCloskey and Cohen and training on random additions one by one. The neural network not only does not suffer from catastrophic forgetting but it improves its predictive power on unseen pairs of numbers as training progresses. We also show that this is a robust effect, also present when averaging many simulations. This work emphasizes the importance that neural network architecture has for the emergence of catastrophic forgetting and introduces a neural network that is able to learn an algorithm.
    Ultrasound Shear Wave Elasticity Imaging with Spatio-Temporal Deep Learning. (arXiv:2204.05745v2 [eess.IV] UPDATED)
    Ultrasound shear wave elasticity imaging is a valuable tool for quantifying the elastic properties of tissue. Typically, the shear wave velocity is derived and mapped to an elasticity value, which neglects information such as the shape of the propagating shear wave or push sequence characteristics. We present 3D spatio-temporal CNNs for fast local elasticity estimation from ultrasound data. This approach is based on retrieving elastic properties from shear wave propagation within small local regions. A large training data set is acquired with a robot from homogeneous gelatin phantoms ranging from 17.42 kPa to 126.05 kPa with various push locations. The results show that our approach can estimate elastic properties on a pixelwise basis with a mean absolute error of 5.01+-4.37 kPa. Furthermore, we estimate local elasticity independent of the push location and can even perform accurate estimates inside the push region. For phantoms with embedded inclusions, we report a 53.93% lower MAE (7.50 kPa) and on the background of 85.24% (1.64 kPa) compared to a conventional shear wave method. Overall, our method offers fast local estimations of elastic properties with small spatio-temporal window sizes.
    Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks. (arXiv:2204.10496v2 [cs.CV] UPDATED)
    Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets. While these datasets reach an order of 10 million samples, the labor cost is prohibitive to scale further. Conversely, unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions. As a result, unimodal encoders have achieved state-of-art (SOTA) on many downstream tasks. However, challenges remain when applying to VL tasks. The pretraining data is not optimal for cross-modal architectures and requires heavy computational resources. In addition, unimodal architectures lack cross-modal interactions that have demonstrated significant benefits for VL tasks. Therefore, how to best leverage pretrained unimodal encoders for VL tasks is still an area of active research. In this work, we propose a method to leverage unimodal vision and text encoders for VL tasks that augment existing VL approaches while conserving computational complexity. Specifically, we propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders. Second, to better capture nuanced impacts on VL task performance, we introduce an evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of data constraints and conditions of domain shift. Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data. Finally, MAD outperforms concurrent works utilizing pretrained vision encoder from CLIP. Code will be made available.
    Cooperative Multi-Agent Reinforcement Learning with Hypergraph Convolution. (arXiv:2112.06771v2 [cs.AI] UPDATED)
    Recent years have witnessed the great success of multi-agent systems (MAS). Value decomposition, which decomposes joint action values into individual action values, has been an important work in MAS. However, many value decomposition methods ignore the coordination among different agents, leading to the notorious "lazy agents" problem. To enhance the coordination in MAS, this paper proposes HyperGraph CoNvolution MIX (HGCN-MIX), a method that incorporates hypergraph convolution with value decomposition. HGCN-MIX models agents as well as their relationships as a hypergraph, where agents are nodes and hyperedges among nodes indicate that the corresponding agents can coordinate to achieve larger rewards. Then, it trains a hypergraph that can capture the collaborative relationships among agents. Leveraging the learned hypergraph to consider how other agents' observations and actions affect their decisions, the agents in a MAS can better coordinate. We evaluate HGCN-MIX in the StarCraft II multi-agent challenge benchmark. The experimental results demonstrate that HGCN-MIX can train joint policies that outperform or achieve a similar level of performance as the current state-of-the-art techniques. We also observe that HGCN-MIX has an even more significant improvement of performance in the scenarios with a large amount of agents. Besides, we conduct additional analysis to emphasize that when the hypergraph learns more relationships, HGCN-MIX can train stronger joint policies.
    An Explainable Regression Framework for Predicting Remaining Useful Life of Machines. (arXiv:2204.13574v1 [cs.LG])
    Prediction of a machine's Remaining Useful Life (RUL) is one of the key tasks in predictive maintenance. The task is treated as a regression problem where Machine Learning (ML) algorithms are used to predict the RUL of machine components. These ML algorithms are generally used as a black box with a total focus on the performance without identifying the potential causes behind the algorithms' decisions and their working mechanism. We believe, the performance (in terms of Mean Squared Error (MSE), etc.,) alone is not enough to build the trust of the stakeholders in ML prediction rather more insights on the causes behind the predictions are needed. To this aim, in this paper, we explore the potential of Explainable AI (XAI) techniques by proposing an explainable regression framework for the prediction of machines' RUL. We also evaluate several ML algorithms including classical and Neural Networks (NNs) based solutions for the task. For the explanations, we rely on two model agnostic XAI methods namely Local Interpretable Model-Agnostic Explanations (LIME) and Shapley Additive Explanations (SHAP). We believe, this work will provide a baseline for future research in the domain.
    Structured Pruning Learns Compact and Accurate Models. (arXiv:2204.00408v2 [cs.CL] UPDATED)
    The growing size of neural language models has led to increased attention in model compression. The two predominant approaches are pruning, which gradually removes weights from a pre-trained model, and distillation, which trains a smaller compact model to match a larger one. Pruning methods can significantly reduce the model size but hardly achieve large speedups as distillation. However, distillation methods require large amounts of unlabeled data and are expensive to train. In this work, we propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning), which delivers highly parallelizable subnetworks and matches the distillation methods in both accuracy and latency, without resorting to any unlabeled data. Our key insight is to jointly prune coarse-grained (e.g., layers) and fine-grained (e.g., heads and hidden units) modules, which controls the pruning decision of each parameter with masks of different granularity. We also devise a layerwise distillation strategy to transfer knowledge from unpruned to pruned models during optimization. Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop, showing its effectiveness and efficiency compared to previous pruning and distillation approaches.
    Scan Specific Artifact Reduction in K-space (SPARK) Neural Networks Synergize with Physics-based Reconstruction to Accelerate MRI. (arXiv:2104.01188v3 [eess.SP] UPDATED)
    Purpose: To develop a scan-specific model that estimates and corrects k-space errors made when reconstructing accelerated Magnetic Resonance Imaging (MRI) data. Methods: Scan-Specific Artifact Reduction in k-space (SPARK) trains a convolutional-neural-network to estimate and correct k-space errors made by an input reconstruction technique by back-propagating from the mean-squared-error loss between an auto-calibration signal (ACS) and the input technique's reconstructed ACS. First, SPARK is applied to GRAPPA and demonstrates improved robustness over other scan-specific models, such as RAKI and residual-RAKI. Subsequent experiments demonstrate that SPARK synergizes with residual-RAKI to improve reconstruction performance. SPARK also improves reconstruction quality when applied to advanced acquisition and reconstruction techniques like 2D virtual coil (VC-) GRAPPA, 2D LORAKS, 3D GRAPPA without an integrated ACS region, and 2D/3D wave-encoded images. Results: SPARK yields 1.5x - 2x RMSE reduction when applied to GRAPPA and improves robustness to ACS size for various acceleration rates in comparison to other scan-specific techniques. When applied to advanced reconstruction techniques such as residual-RAKI, 2D VC-GRAPPA and LORAKS, SPARK achieves up to 20% RMSE improvement. SPARK with 3D GRAPPA also improves performance by ~2x and perceived image quality without a fully sampled ACS region. Finally, SPARK synergizes with non-cartesian 2D and 3D wave-encoding imaging by reducing RMSE between 20-25% and providing qualitative improvements. Conclusion: SPARK synergizes with physics-based acquisition and reconstruction techniques to improve accelerated MRI by training scan-specific models to estimate and correct reconstruction errors in k-space.
    Standardized Evaluation of Machine Learning Methods for Evolving Data Streams. (arXiv:2204.13625v1 [cs.LG])
    Due to the unspecified and dynamic nature of data streams, online machine learning requires powerful and flexible solutions. However, evaluating online machine learning methods under realistic conditions is difficult. Existing work therefore often draws on different heuristics and simulations that do not necessarily produce meaningful and reliable results. Indeed, in the absence of common evaluation standards, it often remains unclear how online learning methods will perform in practice or in comparison to similar work. In this paper, we propose a comprehensive set of properties for high-quality machine learning in evolving data streams. In particular, we discuss sensible performance measures and evaluation strategies for online predictive modelling, online feature selection and concept drift detection. As one of the first works, we also look at the interpretability of online learning methods. The proposed evaluation standards are provided in a new Python framework called float. Float is completely modular and allows the simultaneous integration of common libraries, such as scikit-multiflow or river, with custom code. Float is open-sourced and can be accessed at https://github.com/haugjo/float. In this sense, we hope that our work will contribute to more standardized, reliable and realistic testing and comparison of online machine learning methods.
    Representative period selection for power system planning using autoencoder-based dimensionality reduction. (arXiv:2204.13608v1 [cs.LG])
    Power sector capacity expansion models (CEMs) that are used for studying future low-carbon grid scenarios must incorporate detailed representation of grid operations. Often CEMs are formulated to model grid operations over representative periods that are sampled from the original input data using clustering algorithms. However, such representative period selection (RPS) methods are limited by the declining efficacy of the clustering algorithm with increasing dimensionality of the input data and do not consider the relative importance of input data variations on CEM outcomes. Here, we propose a RPS method that addresses these limitations by incorporating dimensionality reduction, accomplished via neural network based autoencoders, prior to clustering. Such dimensionality reduction not only improves the performance of the clustering algorithm, but also facilitates using additional features, such as estimated outputs produced from parallel solutions of simplified versions of the CEM for each disjoint period in the input data (e.g. 1 week). The impact of incorporating dimensionality reduction as part of RPS methods is quantified through the error in outcomes of the corresponding reduced-space CEM vs. the full space CEM. Extensive numerical experimentation across various networks and range of technology and policy scenarios establish the superiority of the dimensionality-reduction based RPS methods.
    Toward Compositional Generalization in Object-Oriented World Modeling. (arXiv:2204.13661v1 [cs.LG])
    Compositional generalization is a critical ability in learning and decision-making. We focus on the setting of reinforcement learning in object-oriented environments to study compositional generalization in world modeling. We (1) formalize the compositional generalization problem with an algebraic approach and (2) study how a world model can achieve that. We introduce a conceptual environment, Object Library, and two instances, and deploy a principled pipeline to measure the generalization ability. Motivated by the formulation, we analyze several methods with exact} or no compositional generalization ability using our framework, and design a differentiable approach, Homomorphic Object-oriented World Model (HOWM), that achieves approximate but more efficient compositional generalization.
    DOTIN: Dropping Task-Irrelevant Nodes for GNNs. (arXiv:2204.13429v1 [cs.LG])
    Scalability is an important consideration for deep graph neural networks. Inspired by the conventional pooling layers in CNNs, many recent graph learning approaches have introduced the pooling strategy to reduce the size of graphs for learning, such that the scalability and efficiency can be improved. However, these pooling-based methods are mainly tailored to a single graph-level task and pay more attention to local information, limiting their performance in multi-task settings which often require task-specific global information. In this paper, departure from these pooling-based efforts, we design a new approach called DOTIN (\underline{D}r\underline{o}pping \underline{T}ask-\underline{I}rrelevant \underline{N}odes) to reduce the size of graphs. Specifically, by introducing $K$ learnable virtual nodes to represent the graph embeddings targeted to $K$ different graph-level tasks, respectively, up to 90\% raw nodes with low attentiveness with an attention model -- a transformer in this paper, can be adaptively dropped without notable performance decreasing. Achieving almost the same accuracy, our method speeds up GAT by about 50\% on graph-level tasks including graph classification and graph edit distance (GED) with about 60\% less memory, on D\&D dataset. Code will be made publicly available in https://github.com/Sherrylone/DOTIN.
    Prescriptive and Descriptive Approaches to Machine-Learning Transparency. (arXiv:2204.13582v1 [cs.SE])
    Specialized documentation techniques have been developed to communicate key facts about machine-learning (ML) systems and the datasets and models they rely on. Techniques such as Datasheets, FactSheets, and Model Cards have taken a mainly descriptive approach, providing various details about the system components. While the above information is essential for product developers and external experts to assess whether the ML system meets their requirements, other stakeholders might find it less actionable. In particular, ML engineers need guidance on how to mitigate potential shortcomings in order to fix bugs or improve the system's performance. We survey approaches that aim to provide such guidance in a prescriptive way. We further propose a preliminary approach, called Method Cards, which aims to increase the transparency and reproducibility of ML systems by providing prescriptive documentation of commonly-used ML methods and techniques. We showcase our proposal with an example in small object detection, and demonstrate how Method Cards can communicate key considerations for model developers. We further highlight avenues for improving the user experience of ML engineers based on Method Cards.
    Federated Learning on Heterogeneous and Long-Tailed Data via Classifier Re-Training with Federated Features. (arXiv:2204.13399v1 [cs.LG])
    Federated learning (FL) provides a privacy-preserving solution for distributed machine learning tasks. One challenging problem that severely damages the performance of FL models is the co-occurrence of data heterogeneity and long-tail distribution, which frequently appears in real FL applications. In this paper, we reveal an intriguing fact that the biased classifier is the primary factor leading to the poor performance of the global model. Motivated by the above finding, we propose a novel and privacy-preserving FL method for heterogeneous and long-tailed data via Classifier Re-training with Federated Features (CReFF). The classifier re-trained on federated features can produce comparable performance as the one re-trained on real data in a privacy-preserving manner without information leakage of local data or class distribution. Experiments on several benchmark datasets show that the proposed CReFF is an effective solution to obtain a promising FL model under heterogeneous and long-tailed data. Comparative results with the state-of-the-art FL methods also validate the superiority of CReFF. Our code is available at https://github.com/shangxinyi/CReFF-FL.
    Worst-Case Dynamic Power Distribution Network Noise Prediction Using Convolutional Neural Network. (arXiv:2204.13109v1 [cs.LG])
    Worst-case dynamic PDN noise analysis is an essential step in PDN sign-off to ensure the performance and reliability of chips. However, with the growing PDN size and increasing scenarios to be validated, it becomes very time- and resource-consuming to conduct full-stack PDN simulation to check the worst-case noise for different test vectors. Recently, various works have proposed machine learning based methods for supply noise prediction, many of which still suffer from large training overhead, inefficiency, or non-scalability. Thus, this paper proposed an efficient and scalable framework for the worst-case dynamic PDN noise prediction. The framework first reduces the spatial and temporal redundancy in the PDN and input current vector, and then employs efficient feature extraction as well as a novel convolutional neural network architecture to predict the worst-case dynamic PDN noise. Experimental results show that the proposed framework consistently outperforms the commercial tool and the state-of-the-art machine learning method with only 0.63-1.02% mean relative error and 25-69$\times$ speedup.
    Use All The Labels: A Hierarchical Multi-Label Contrastive Learning Framework. (arXiv:2204.13207v1 [cs.CV])
    Current contrastive learning frameworks focus on leveraging a single supervisory signal to learn representations, which limits the efficacy on unseen data and downstream tasks. In this paper, we present a hierarchical multi-label representation learning framework that can leverage all available labels and preserve the hierarchical relationship between classes. We introduce novel hierarchy preserving losses, which jointly apply a hierarchical penalty to the contrastive loss, and enforce the hierarchy constraint. The loss function is data driven and automatically adapts to arbitrary multi-label structures. Experiments on several datasets show that our relationship-preserving embedding performs well on a variety of tasks and outperform the baseline supervised and self-supervised approaches. Code is available at https://github.com/salesforce/hierarchicalContrastiveLearning.
    Neural network controllers for uncertain linear systems. (arXiv:2204.13209v1 [eess.SY])
    We consider the design of reliable neural network (NN)-based approximations of traditional stabilizing controllers for linear systems affected by polytopic uncertainty, including controllers with variable structure and those based on a minimal selection policy. We develop a systematic procedure to certify the closed-loop stability and performance of a polytopic system when a rectified linear unit (ReLU)-based approximation replaces such traditional controllers. We provide sufficient conditions to ensure stability involving the worst-case approximation error and the Lipschitz constant characterizing the error function between ReLU-based and traditional controller-based state-to-input mappings, and further provide offline, mixed-integer optimization-based methods that allow us to compute those quantities exactly.
    Epileptic Seizure Classification Using Combined Labels and a Genetic Algorithm. (arXiv:2110.01742v4 [eess.SP] UPDATED)
    Epilepsy affects 50 million people worldwide and is one of the most common serious neurological disorders. Seizure detection and classification is a valuable tool for diagnosing and maintaining the condition. An automated classification algorithm will allow for accurate diagnosis. Utilising the Temple University Hospital (TUH) Seizure Corpus, six seizure types are compared; absence, complex partial, myoclonic, simple partial, tonic and tonic- clonic models. This study proposes a method that utilises unique features with a novel parallel classifier - Parallel Genetic Naive Bayes (NB) Seizure Classifier (PGNBSC). The PGNBSC algorithm searches through the features and by reclassifying the data each time, the algorithm will create a matrix for optimum search criteria. Ictal states from the EEGs are segmented into 1.8 s windows, where the epochs are then further decomposed into 13 different features from the first intrinsic mode function (IMF). The features are compared using an original NB classifier in the first model. This is improved upon in a second model by using a genetic algorithm (Binary Grey Wolf Optimisation, Option 1) with a NB classifier. The third model uses a combination of the simple partial and complex partial seizures to provide the highest classification accuracy for each of the six seizures amongst the three models (20%, 53%, and 85% for first, second, and third model, respectively).
    A Decision Model for Federated Learning Architecture Pattern Selection. (arXiv:2204.13291v1 [cs.LG])
    Federated learning is growing fast in both academia and industry to resolve data hungriness and privacy issues in machine learning. A federated learning system being widely distributed with different components and stakeholders requires software system design thinking. For instance, multiple patterns and tactics have been summarised by researchers that cover various aspects, from client management, training configuration, model deployment, etc. However, the multitude of patterns leaves the designers confused about when and which pattern to adopt or adapt. Therefore, in this paper, we present a set of decision models to assist designers and architects who have limited knowledge in federated learning, in selecting architectural patterns for federated learning architecture design. Each decision model maps functional and non-functional requirements of federated learning systems to a set of patterns. we also clarify the trade-offs that may be implicit in the patterns. We evaluated the decision model through a set of interviews with practitioners to assess the correctness and usefulness in guiding the architecture design process through various design decision options.
    Efficient-VDVAE: Less is more. (arXiv:2203.13751v2 [cs.LG] UPDATED)
    Hierarchical VAEs have emerged in recent years as a reliable option for maximum likelihood estimation. However, instability issues and demanding computational requirements have hindered research progress in the area. We present simple modifications to the Very Deep VAE to make it converge up to $2.6\times$ faster, save up to $20\times$ in memory load and improve stability during training. Despite these changes, our models achieve comparable or better negative log-likelihood performance than current state-of-the-art models on all $7$ commonly used image datasets we evaluated on. We also make an argument against using 5-bit benchmarks as a way to measure hierarchical VAE's performance due to undesirable biases caused by the 5-bit quantization. Additionally, we empirically demonstrate that roughly $3\%$ of the hierarchical VAE's latent space dimensions is sufficient to encode most of the image information, without loss of performance, opening up the doors to efficiently leverage the hierarchical VAEs' latent space in downstream tasks. We release our source code and models at https://github.com/Rayhane-mamah/Efficient-VDVAE .
    Attention Mechanism in Neural Networks: Where it Comes and Where it Goes. (arXiv:2204.13154v1 [cs.LG])
    A long time ago in the machine learning literature, the idea of incorporating a mechanism inspired by the human visual system into neural networks was introduced. This idea is named the attention mechanism, and it has gone through a long development period. Today, many works have been devoted to this idea in a variety of tasks. Remarkable performance has recently been demonstrated. The goal of this paper is to provide an overview from the early work on searching for ways to implement attention idea with neural networks until the recent trends. This review emphasizes the important milestones during this progress regarding different tasks. By this way, this study aims to provide a road map for researchers to explore the current development and get inspired for novel approaches beyond the attention.
    Multi-Agent Reinforcement Learning for Network Load Balancing in Data Center. (arXiv:2201.11727v3 [cs.DC] UPDATED)
    This paper presents the network load balancing problem, a challenging real-world task for multi-agent reinforcement learning (MARL) methods. Traditional heuristic solutions like Weighted-Cost Multi-Path (WCMP) and Local Shortest Queue (LSQ) are less flexible to the changing workload distributions and arrival rates, with a poor balance among multiple load balancers. The cooperative network load balancing task is formulated as a Dec-POMDP problem, which naturally induces the MARL methods. To bridge the reality gap for applying learning-based methods, all methods are directly trained and evaluated on an emulation system from moderate-to large-scale. Experiments on realistic testbeds show that the independent and "selfish" load balancing strategies are not necessarily the globally optimal ones, while the proposed MARL solution has a superior performance over different realistic settings. Additionally, the potential difficulties of MARL methods for network load balancing are analysed, which helps to draw the attention of the learning and network communities to such challenges.
    DIANES: A DEI Audit Toolkit for News Sources. (arXiv:2203.11383v2 [cs.IR] UPDATED)
    Professional news media organizations have always touted the importance that they give to multiple perspectives. However, in practice the traditional approach to all-sides has favored people in the dominant culture. Hence it has come under ethical critique under the new norms of diversity, equity, and inclusion (DEI). When DEI is applied to journalism, it goes beyond conventional notions of impartiality and bias and instead democratizes the journalistic practice of sourcing -- who is quoted or interviewed, who is not, how often, from which demographic group, gender, and so forth. There is currently no real-time or on-demand tool in the hands of reporters to analyze the persons they quote. In this paper, we present DIANES, a DEI Audit Toolkit for News Sources. It consists of a natural language processing pipeline on the backend to extract quotes, speakers, titles, and organizations from news articles in real time. On the frontend, DIANES offers the WordPress plugins, a Web monitor, and a DEI annotation API service, to help news media monitor their own quoting patterns and push themselves towards DEI norms.
    Pedagogical Rule Extraction to Learn Interpretable Models - an Empirical Study. (arXiv:2112.13285v2 [cs.LG] UPDATED)
    Machine-learning models are ubiquitous. In some domains, for instance, in medicine, the models' predictions must be interpretable. Decision trees, classification rules, and subgroup discovery are three broad categories of supervised machine-learning models presenting knowledge in the form of interpretable rules. The accuracy of these models learned from small datasets is usually low. Obtaining larger datasets is often hard to impossible. Pedagogical rule extraction methods could help to learn better rules from small data by augmenting a dataset employing statistical models and using it to learn a rule-based model. However, existing evaluation of these methods is often inconclusive, and they were not compared so far. Our framework PRELIM unifies existing pedagogical rule extraction techniques. In the extensive experiments, we identified promising PRELIM configurations not studied before.
    Continual learning-based probabilistic slow feature analysis for multimode dynamic process monitoring. (arXiv:2202.11295v2 [cs.LG] UPDATED)
    In this paper, a novel multimode dynamic process monitoring approach is proposed by extending elastic weight consolidation (EWC) to probabilistic slow feature analysis (PSFA) in order to extract multimode slow features for online monitoring. EWC was originally introduced in the setting of machine learning of sequential multi-tasks with the aim of avoiding catastrophic forgetting issue, which equally poses as a major challenge in multimode dynamic process monitoring. When a new mode arrives, a set of data should be collected so that this mode can be identified by PSFA and prior knowledge. Then, a regularization term is introduced to prevent new data from significantly interfering with the learned knowledge, where the parameter importance measures are estimated. The proposed method is denoted as PSFA-EWC, which is updated continually and capable of achieving excellent performance for successive modes. Different from traditional multimode monitoring algorithms, PSFA-EWC furnishes backward and forward transfer ability. The significant features of previous modes are retained while consolidating new information, which may contribute to learning new relevant modes. Compared with several known methods, the effectiveness of the proposed method is demonstrated via a continuous stirred tank heater and a practical coal pulverizing system.
    Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products. (arXiv:2202.05120v2 [cs.DS] UPDATED)
    We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm. Here, given access to a matrix $A$ through matrix-vector products, an accuracy parameter $\epsilon$, and a target rank $k$, the goal is to find a rank-$k$ matrix $Z$ with orthonormal columns such that $\| A(I -ZZ^\top)\|_{S_p} \leq (1+\epsilon)\min_{U^\top U = I_k} \|A(I - U U^\top)\|_{S_p}$, where $\|M\|_{S_p}$ denotes the $\ell_p$ norm of the the singular values of $M$. For the special cases of $p=2$ (Frobenius norm) and $p = \infty$ (Spectral norm), Musco and Musco (NeurIPS 2015) obtained an algorithm based on Krylov methods that uses $\tilde{O}(k/\sqrt{\epsilon})$ matrix-vector products, improving on the na\"ive $\tilde{O}(k/\epsilon)$ dependence obtainable by the power method, where $\tilde{O}$ suppresses poly$(\log(dk/\epsilon))$ factors. Our main result is an algorithm that uses only $\tilde{O}(kp^{1/6}/\epsilon^{1/3})$ matrix-vector products, and works for all $p \geq 1$. For $p = 2$ our bound improves the previous $\tilde{O}(k/\epsilon^{1/2})$ bound to $\tilde{O}(k/\epsilon^{1/3})$. Since the Schatten-$p$ and Schatten-$\infty$ norms are the same up to a $1+ \epsilon$ factor when $p \geq (\log d)/\epsilon$, our bound recovers the result of Musco and Musco for $p = \infty$. Further, we prove a matrix-vector query lower bound of $\Omega(1/\epsilon^{1/3})$ for any fixed constant $p \geq 1$, showing that surprisingly $\tilde{\Theta}(1/\epsilon^{1/3})$ is the optimal complexity for constant~$k$. To obtain our results, we introduce several new techniques, including optimizing over multiple Krylov subspaces simultaneously, and pinching inequalities for partitioned operators. Our lower bound for $p \in [1,2]$ uses the Araki-Lieb-Thirring trace inequality, whereas for $p>2$, we appeal to a norm-compression inequality for aligned partitioned operators.
    Experimental quantum advantage with quantum coupon collector. (arXiv:2112.07884v2 [quant-ph] UPDATED)
    An increasing number of communication and computational schemes with quantum advantages have recently been proposed, which implies that quantum technology has fertile application prospects. However, demonstrating these schemes experimentally continues to be a central challenge because of the difficulty in preparing high-dimensional states or highly entangled states. In this study, we introduce and analyse a quantum coupon collector protocol by employing coherent states and simple linear optical elements, which was successfully demonstrated using realistic experimental equipment. We showed that our protocol can significantly reduce the number of samples needed to learn a specific set compared with the classical limit of the coupon collector problem. We also discuss the potential values and expansions of the quantum coupon collector by constructing a quantum blind box game. The information transmitted by the proposed game also broke the classical limit. These results strongly prove the advantages of quantum mechanics in machine learning and communication complexity.
    On Exploiting Layerwise Gradient Statistics for Effective Training of Deep Neural Networks. (arXiv:2203.13273v3 [cs.LG] UPDATED)
    Adam and AdaBelief compute and make use of elementwise adaptive stepsizes in training deep neural networks (DNNs) by tracking the exponential moving average (EMA) of the squared-gradient g_t^2 and the squared prediction error (m_t-g_t)^2, respectively, where m_t is the first momentum at iteration t and can be viewed as a prediction of g_t. In this work, we investigate if layerwise gradient statistics can be expoited in Adam and AdaBelief to allow for more effective training of DNNs. We address the above research question in two steps. Firstly, we slightly modify Adam and AdaBelief by introducing layerwise adaptive stepsizes in their update procedures via either pre- or post-processing. Our empirical results indicate that the slight modification produces comparable performance for training VGG and ResNet models over CIFAR10 and CIFAR100, suggesting that layer-wise gradient statistics play an important role towards the success of Adam and AdaBelief for at least certian DNN tasks. In the second step, we propose Aida, a new optimisation method, with the objective that the elementwise stepsizes within each layer have significantly smaller statistical variances, and the layerwise average stepsizes are much more compact across all the layers. Motivated by the fact that (m_t-g_t)^2 in AdaBelief is conservative in comparison to g_t^2 in Adam in terms of layerwise statistical averages and variances, Aida is designed by tracking a more conservative function of m_t and g_t than (m_t-g_t)^2 via layerwise vector projections. Experimental results show that Aida produces either competitive or better performance with respect to a number of existing methods including Adam and AdaBelief for a set of challenging DNN tasks.
    Globally Optimal Hierarchical Reinforcement Learning for Linearly-Solvable Markov Decision Processes. (arXiv:2106.15380v3 [cs.LG] UPDATED)
    In this work we present a novel approach to hierarchical reinforcement learning for linearly-solvable Markov decision processes. Our approach assumes that the state space is partitioned, and the subtasks consist in moving between the partitions. We represent value functions on several levels of abstraction, and use the compositionality of subtasks to estimate the optimal values of the states in each partition. The policy is implicitly defined on these optimal value estimates, rather than being decomposed among the subtasks. As a consequence, our approach can learn the globally optimal policy, and does not suffer from the non-stationarity of high-level decisions. If several partitions have equivalent dynamics, the subtasks of those partitions can be shared. If the set of boundary states is smaller than the entire state space, our approach can have significantly smaller sample complexity than that of a flat learner, and we validate this empirically in several experiments.
    Multiplicative Updates for NMF with $\beta$-Divergences under Disjoint Equality Constraints. (arXiv:2010.16223v2 [cs.LG] UPDATED)
    Nonnegative matrix factorization (NMF) is the problem of approximating an input nonnegative matrix, $V$, as the product of two smaller nonnegative matrices, $W$ and $H$. In this paper, we introduce a general framework to design multiplicative updates (MU) for NMF based on $\beta$-divergences ($\beta$-NMF) with disjoint equality constraints, and with penalty terms in the objective function. By disjoint, we mean that each variable appears in at most one equality constraint. Our MU satisfy the set of constraints after each update of the variables during the optimization process, while guaranteeing that the objective function decreases monotonically. We showcase this framework on three NMF models, and show that it competes favorably the state of the art: (1)~$\beta$-NMF with sum-to-one constraints on the columns of $H$, (2) minimum-volume $\beta$-NMF with sum-to-one constraints on the columns of $W$, and (3) sparse $\beta$-NMF with $\ell_2$-norm constraints on the columns of $W$.
    Efficient Geometry-aware 3D Generative Adversarial Networks. (arXiv:2112.07945v2 [cs.CV] UPDATED)
    Unsupervised generation of high-quality multi-view-consistent images and 3D shapes using only collections of single-view 2D photographs has been a long-standing challenge. Existing 3D GANs are either compute-intensive or make approximations that are not 3D-consistent; the former limits quality and resolution of the generated images and the latter adversely affects multi-view consistency and shape quality. In this work, we improve the computational efficiency and image quality of 3D GANs without overly relying on these approximations. We introduce an expressive hybrid explicit-implicit network architecture that, together with other design choices, synthesizes not only high-resolution multi-view-consistent images in real time but also produces high-quality 3D geometry. By decoupling feature generation and neural rendering, our framework is able to leverage state-of-the-art 2D CNN generators, such as StyleGAN2, and inherit their efficiency and expressiveness. We demonstrate state-of-the-art 3D-aware synthesis with FFHQ and AFHQ Cats, among other experiments.
    On Evaluation Metrics for Graph Generative Models. (arXiv:2201.09871v2 [cs.LG] UPDATED)
    In image generation, generative models can be evaluated naturally by visually inspecting model outputs. However, this is not always the case for graph generative models (GGMs), making their evaluation challenging. Currently, the standard process for evaluating GGMs suffers from three critical limitations: i) it does not produce a single score which makes model selection challenging, ii) in many cases it fails to consider underlying edge and node features, and iii) it is prohibitively slow to perform. In this work, we mitigate these issues by searching for scalar, domain-agnostic, and scalable metrics for evaluating and ranking GGMs. To this end, we study existing GGM metrics and neural-network-based metrics emerging from generative models of images that use embeddings extracted from a task-specific network. Motivated by the power of certain Graph Neural Networks (GNNs) to extract meaningful graph representations without any training, we introduce several metrics based on the features extracted by an untrained random GNN. We design experiments to thoroughly test metrics on their ability to measure the diversity and fidelity of generated graphs, as well as their sample and computational efficiency. Depending on the quantity of samples, we recommend one of two random-GNN-based metrics that we show to be more expressive than pre-existing metrics. While we focus on applying these metrics to GGM evaluation, in practice this enables the ability to easily compute the dissimilarity between any two sets of graphs regardless of domain. Our code is released at: https://github.com/uoguelph-mlrg/GGM-metrics.
    Trivial or impossible -- dichotomous data difficulty masks model differences (on ImageNet and beyond). (arXiv:2110.05922v3 [cs.CV] UPDATED)
    "The power of a generalization system follows directly from its biases" (Mitchell 1980). Today, CNNs are incredibly powerful generalisation systems -- but to what degree have we understood how their inductive bias influences model decisions? We here attempt to disentangle the various aspects that determine how a model decides. In particular, we ask: what makes one model decide differently from another? In a meticulously controlled setting, we find that (1.) irrespective of the network architecture or objective (e.g. self-supervised, semi-supervised, vision transformers, recurrent models) all models end up with a similar decision boundary. (2.) To understand these findings, we analysed model decisions on the ImageNet validation set from epoch to epoch and image by image. We find that the ImageNet validation set, among others, suffers from dichotomous data difficulty (DDD): For the range of investigated models and their accuracies, it is dominated by 46.0% "trivial" and 11.5% "impossible" images (beyond label errors). Only 42.5% of the images could possibly be responsible for the differences between two models' decision boundaries. (3.) Only removing the "impossible" and "trivial" images allows us to see pronounced differences between models. (4.) Humans are highly accurate at predicting which images are "trivial" and "impossible" for CNNs (81.4%). This implies that in future comparisons of brains, machines and behaviour, much may be gained from investigating the decisive role of images and the distribution of their difficulties.
    Variational Inference with NoFAS: Normalizing Flow with Adaptive Surrogate for Computationally Expensive Models. (arXiv:2108.12657v2 [cs.LG] UPDATED)
    Fast inference of numerical model parameters from data is an important prerequisite to generate predictive models for a wide range of applications. Use of sampling-based approaches such as Markov chain Monte Carlo may become intractable when each likelihood evaluation is computationally expensive. New approaches combining variational inference with normalizing flow are characterized by a computational cost that grows only linearly with the dimensionality of the latent variable space, and rely on gradient-based optimization instead of sampling, providing a more efficient approach for Bayesian inference about the model parameters. Moreover, the cost of frequently evaluating an expensive likelihood can be mitigated by replacing the true model with an offline trained surrogate model, such as neural networks. However, this approach might generate significant bias when the surrogate is insufficiently accurate around the posterior modes. To reduce the computational cost without sacrificing inferential accuracy, we propose Normalizing Flow with Adaptive Surrogate (NoFAS), an optimization strategy that alternatively updates the normalizing flow parameters and surrogate model parameters. We also propose an efficient sample weighting scheme for surrogate model training that preserves global accuracy while effectively capturing high posterior density regions. We demonstrate the inferential and computational superiority of NoFAS against various benchmarks, including cases where the underlying model lacks identifiability. The source code and numerical experiments used for this study are available at https://github.com/cedricwangyu/NoFAS.
    Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control. (arXiv:2110.01052v4 [cs.LG] UPDATED)
    We introduce a framework for calibrating machine learning models so that their predictions satisfy explicit, finite-sample statistical guarantees. Our calibration algorithm works with any underlying model and (unknown) data-generating distribution and does not require model refitting. The framework addresses, among other examples, false discovery rate control in multi-label classification, intersection-over-union control in instance segmentation, and the simultaneous control of the type-1 error of outlier detection and confidence set coverage in classification or regression. Our main insight is to reframe the risk-control problem as multiple hypothesis testing, enabling techniques and mathematical arguments different from those in the previous literature. We use our framework to provide new calibration methods for several core machine learning tasks with detailed worked examples in computer vision and tabular medical data.
    On Expressivity and Trainability of Quadratic Networks. (arXiv:2110.06081v2 [cs.LG] UPDATED)
    Inspired by the diversity of biological neurons, quadratic artificial neurons can play an important role in deep learning models. The type of quadratic neurons of our interest replaces the inner-product operation in the conventional neuron with a quadratic function. Despite promising results so far achieved by networks of quadratic neurons, there are important issues not well addressed. Theoretically, the superior expressivity of a quadratic network over either a conventional network or a conventional network via quadratic activation is not fully elucidated, which makes the use of quadratic networks not well grounded. Practically, although a quadratic network can be trained via generic backpropagation, it can be subject to a higher risk of collapse than the conventional counterpart. To address these issues, we first apply the spline theory and a measure from algebraic geometry to give two theorems that demonstrate better model expressivity of a quadratic network than the conventional counterpart with or without quadratic activation. Then, we propose an effective and efficient training strategy referred to as ReLinear to stabilize the training process of a quadratic network, thereby unleashing the full potential in its associated machine learning tasks. Comprehensive experiments on popular datasets are performed to support our findings and evaluate the performance of quadratic deep learning.
    Artificial Text Detection via Examining the Topology of Attention Maps. (arXiv:2109.04825v2 [cs.CL] UPDATED)
    The impressive capabilities of recent generative models to create texts that are challenging to distinguish from the human-written ones can be misused for generating fake news, product reviews, and even abusive content. Despite the prominent performance of existing methods for artificial text detection, they still lack interpretability and robustness towards unseen models. To this end, we propose three novel types of interpretable topological features for this task based on Topological Data Analysis (TDA) which is currently understudied in the field of NLP. We empirically show that the features derived from the BERT model outperform count- and neural-based baselines up to 10\% on three common datasets, and tend to be the most robust towards unseen GPT-style generation models as opposed to existing methods. The probing analysis of the features reveals their sensitivity to the surface and syntactic properties. The results demonstrate that TDA is a promising line with respect to NLP tasks, specifically the ones that incorporate surface and structural information.
    SCRDet++: Detecting Small, Cluttered and Rotated Objects via Instance-Level Feature Denoising and Rotation Loss Smoothing. (arXiv:2004.13316v2 [cs.CV] UPDATED)
    Small and cluttered objects are common in real-world which are challenging for detection. The difficulty is further pronounced when the objects are rotated, as traditional detectors often routinely locate the objects in horizontal bounding box such that the region of interest is contaminated with background or nearby interleaved objects. In this paper, we first innovatively introduce the idea of denoising to object detection. Instance-level denoising on the feature map is performed to enhance the detection to small and cluttered objects. To handle the rotation variation, we also add a novel IoU constant factor to the smooth L1 loss to address the long standing boundary problem, which to our analysis, is mainly caused by the periodicity of angular (PoA) and exchangeability of edges (EoE). By combing these two features, our proposed detector is termed as SCRDet++. Extensive experiments are performed on large aerial images public datasets DOTA, DIOR, UCAS-AOD as well as natural image dataset COCO, scene text dataset ICDAR2015, small traffic light dataset BSTLD and our released S$^2$TLD by this paper. The results show the effectiveness of our approach. The released dataset S2TLD is made public available, which contains 5,786 images with 14,130 traffic light instances across five categories.
    Policy Gradient Stock GAN for Realistic Discrete Order Data Generation in Financial Markets. (arXiv:2204.13338v1 [cs.LG])
    This study proposes a new generative adversarial network (GAN) for generating realistic orders in financial markets. In some previous works, GANs for financial markets generated fake orders in continuous spaces because of GAN architectures' learning limitations. However, in reality, the orders are discrete, such as order prices, which has minimum order price unit, or order types. Thus, we change the generation method to place the generated fake orders into discrete spaces in this study. Because this change disabled the ordinary GAN learning algorithm, this study employed a policy gradient, frequently used in reinforcement learning, for the learning algorithm. Through our experiments, we show that our proposed model outperforms previous models in generated order distribution. As an additional benefit of introducing the policy gradient, the entropy of the generated policy can be used to check GAN's learning status. In the future, higher performance GANs, better evaluation methods, or the applications of our GANs can be addressed.
    Machine learning on DNA-encoded library count data using an uncertainty-aware probabilistic loss function. (arXiv:2108.12471v2 [q-bio.QM] UPDATED)
    DNA-encoded library (DEL) screening and quantitative structure-activity relationship (QSAR) modeling are two techniques used in drug discovery to find small molecules that bind a protein target. Applying QSAR modeling to DEL data can facilitate the selection of compounds for off-DNA synthesis and evaluation. Such a combined approach has been shown recently by training binary classifiers to learn DEL enrichments of aggregated "disynthons" to accommodate the sparse and noisy nature of DEL data. However, a binary classifier cannot distinguish between different levels of enrichment, and information is potentially lost during disynthon aggregation. Here, we demonstrate a regression approach to learning DEL enrichments of individual molecules using a custom negative log-likelihood loss function that effectively denoises DEL data and introduces opportunities for visualization of learned structure-activity relationships (SAR). Our approach explicitly models the Poisson statistics of the sequencing process used in the DEL experimental workflow under a frequentist view. We illustrate this approach on a dataset of 108k compounds screened against CAIX, and a dataset of 5.7M compounds screened against sEH and SIRT2. Due to the treatment of uncertainty in the data through the negative log-likelihood loss function, the models can ignore low-confidence outliers. While our approach does not demonstrate a benefit for extrapolation to novel structures, we expect our denoising and visualization pipeline to be useful in identifying SAR trends and enriched pharmacophores in DEL data. Further, this approach to uncertainty-aware regression is applicable to other sparse or noisy datasets where the nature of stochasticity is known or can be modeled; in particular, the Poisson enrichment ratio metric we use can apply to other settings that compare sequencing count data between two experimental conditions.
    Predicting S&P500 Index direction with Transfer Learning and a Causal Graph as main Input. (arXiv:2011.13113v3 [q-fin.ST] UPDATED)
    We propose a unified multi-tasking framework to represent the complex and uncertain causal process of financial market dynamics, and then to predict the movement of any type of index with an application on the monthly direction of the S&P500 index. our solution is based on three main pillars: (i) the use of transfer learning to share knowledge and feature (representation, learning) between all financial markets, increase the size of the training sample and preserve the stability between training, validation and test sample. (ii) The combination of multidisciplinary knowledge (Financial economics, behavioral finance, market microstructure and portfolio construction theories) to represent a global top-down dynamics of any financial market, through a graph. (iii) The integration of forward looking unstructured data, different types of contexts (long, medium and short term) through latent variables/nodes and then, use a unique VAE network (parameter sharing) to learn simultaneously their distributional representation. We obtain Accuracy, F1-score, and Matthew Correlation of 74.3 %, 67 % and 0.42 above the industry and other benchmark on 12 years test period which include three unstable and difficult sub-period to predict.
    Bilinear value networks. (arXiv:2204.13695v1 [cs.AI])
    The dominant framework for off-policy multi-goal reinforcement learning involves estimating goal conditioned Q-value function. When learning to achieve multiple goals, data efficiency is intimately connected with the generalization of the Q-function to new goals. The de-facto paradigm is to approximate Q(s, a, g) using monolithic neural networks. To improve the generalization of the Q-function, we propose a bilinear decomposition that represents the Q-value via a low-rank approximation in the form of a dot product between two vector fields. The first vector field, f(s, a), captures the environment's local dynamics at the state s; whereas the second component, {\phi}(s, g), captures the global relationship between the current state and the goal. We show that our bilinear decomposition scheme substantially improves data efficiency, and has superior transfer to out-of-distribution goals compared to prior methods. Empirical evidence is provided on the simulated Fetch robot task-suite and dexterous manipulation with a Shadow hand.
    AlphaZero-Inspired General Board Game Learning and Playing. (arXiv:2204.13307v1 [cs.LG])
    Recently, the seminal algorithms AlphaGo and AlphaZero have started a new era in game learning and deep reinforcement learning. While the achievements of AlphaGo and AlphaZero - playing Go and other complex games at super human level - are truly impressive, these architectures have the drawback that they are very complex and require high computational resources. Many researchers are looking for methods that are similar to AlphaZero, but have lower computational demands and are thus more easily reproducible. In this paper, we pick an important element of AlphaZero - the Monte Carlo Tree Search (MCTS) planning stage - and combine it with reinforcement learning (RL) agents. We wrap MCTS for the first time around RL n-tuple networks to create versatile agents that keep at the same time the computational demands low. We apply this new architecture to several complex games (Othello, ConnectFour, Rubik's Cube) and show the advantages achieved with this AlphaZero-inspired MCTS wrapper. In particular, we present results that this AlphaZero-inspired agent is the first one trained on standard hardware (no GPU or TPU) to beat the very strong Othello program Edax up to and including level 7 (where most other algorithms could only defeat Edax up to level 2).
    Exchangeability-Aware Sum-Product Networks. (arXiv:2110.05165v2 [cs.LG] UPDATED)
    Sum-Product Networks (SPNs) are expressive probabilistic models that provide exact, tractable inference. They achieve this efficiency by making use of local independence. On the other hand, mixtures of exchangeable variable models (MEVMs) are a class of tractable probabilistic models that make use of exchangeability of discrete random variables to render inference tractable. Exchangeability, which arises naturally in relational domains, has not been considered for efficient representation and inference in SPNs yet. The contribution of this paper is a novel probabilistic model which we call Exchangeability-Aware Sum-Product Networks (XSPNs). It contains both SPNs and MEVMs as special cases, and combines the ability of SPNs to efficiently learn deep probabilistic models with the ability of MEVMs to efficiently handle exchangeable random variables. We introduce a structure learning algorithm for XSPNs and empirically show that they can be more accurate than conventional SPNs when the data contains repeated, interchangeable parts.
    Supervised machine learning classification for short straddles on the S&P500. (arXiv:2204.13587v1 [q-fin.CP])
    In this working paper we present our current progress in the training of machine learning models to execute short option strategies on the S&P500. As a first step, this paper is breaking this problem down to a supervised classification task to decide if a short straddle on the S&P500 should be executed or not on a daily basis. We describe our used framework and present an overview over our evaluation metrics on different classification models. In this preliminary work, using standard machine learning techniques and without hyperparameter search, we find no statistically significant outperformance to a simple "trade always" strategy, but gain additional insights on how we could proceed in further experiments.
    Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models: Extension. (arXiv:1905.10395v5 [cs.LG] UPDATED)
    We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the currently best-performing worker (leader). Our method differs from the parameter-averaging scheme EASGD in a number of ways: (i) our objective formulation does not change the location of stationary points compared to the original optimization problem; (ii) we avoid convergence decelerations caused by pulling local workers descending to different local minima to each other (i.e. to the average of their parameters); (iii) our update by design breaks the curse of symmetry (the phenomenon of being trapped in poorly generalizing sub-optimal solutions in symmetric non-convex landscapes); and (iv) our approach is more communication efficient since it broadcasts only parameters of the leader rather than all workers. We provide theoretical analysis of the batch version of the proposed algorithm, which we call Leader Gradient Descent (LGD), and its stochastic variant (LSGD). Finally, we implement an asynchronous version of our algorithm and extend it to the multi-leader setting, where we form groups of workers, each represented by its own local leader (the best performer in a group), and update each worker with a corrective direction comprised of two attractive forces: one to the local, and one to the global leader (the best performer among all workers). The multi-leader setting is well-aligned with current hardware architecture, where local workers forming a group lie within a single computational node and different groups correspond to different nodes. For training convolutional neural networks, we empirically demonstrate that our approach compares favorably to state-of-the-art baselines. This work is a gentle extension of [2].
    Overcoming Catastrophic Forgetting via Direction-Constrained Optimization. (arXiv:2011.12581v2 [cs.LG] UPDATED)
    This paper studies a new design of the optimization algorithm for training deep learning models with a fixed architecture of the classification network in a continual learning framework. The training data is non-stationary and the non-stationarity is imposed by a sequence of distinct tasks. We first analyze a deep model trained on only one learning task in isolation and identify a region in network parameter space, where the model performance is close to the recovered optimum. We provide empirical evidence that this region resembles a cone that expands along the convergence direction. We study the principal directions of the trajectory of the optimizer after convergence and show that traveling along a few top principal directions can quickly bring the parameters outside the cone but this is not the case for the remaining directions. We argue that catastrophic forgetting in a continual learning setting can be alleviated when the parameters are constrained to stay within the intersection of the plausible cones of individual tasks that were so far encountered during training. Based on this observation we present our direction-constrained optimization (DCO) method, where for each task we introduce a linear autoencoder to approximate its corresponding top forbidden principal directions. They are then incorporated into the loss function in the form of a regularization term for the purpose of learning the coming tasks without forgetting. Furthermore, in order to control the memory growth as the number of tasks increases, we propose a memory-efficient version of our algorithm called compressed DCO (DCO-COMP) that allocates a memory of fixed size for storing all autoencoders. We empirically demonstrate that our algorithm performs favorably compared to other state-of-art regularization-based continual learning methods.
    Evaluating Contextual Embeddings and their Extraction Layers for Depression Assessment. (arXiv:2112.13795v2 [cs.CL] UPDATED)
    Recent works have demonstrated ability to assess aspects of mental health from personal discourse. At the same time, pre-trained contextual word embedding models have grown to dominate much of NLP but little is known empirically on how to best apply them for mental health assessment. Using degree of depression as a case study, we do an empirical analysis on which off-the-shelf language model, individual layers, and combinations of layers seem most promising when applied to human-level NLP tasks. Notably, we find RoBERTa most effective and, despite the standard in past work suggesting the second-to-last or concatenation of the last 4 layers, we find layer 19 (sixth-to last) is at least as good as layer 23 when using 1 layer. Further, when using multiple layers, distributing them across the second half (i.e. Layers 12+), rather than last 4, of the 24 layers yielded the most accurate results.
    An Adversarial Attack Analysis on Malicious Advertisement URL Detection Framework. (arXiv:2204.13172v1 [cs.LG])
    Malicious advertisement URLs pose a security risk since they are the source of cyber-attacks, and the need to address this issue is growing in both industry and academia. Generally, the attacker delivers an attack vector to the user by means of an email, an advertisement link or any other means of communication and directs them to a malicious website to steal sensitive information and to defraud them. Existing malicious URL detection techniques are limited and to handle unseen features as well as generalize to test data. In this study, we extract a novel set of lexical and web-scrapped features and employ machine learning technique to set up system for fraudulent advertisement URLs detection. The combination set of six different kinds of features precisely overcome the obfuscation in fraudulent URL classification. Based on different statistical properties, we use twelve different formatted datasets for detection, prediction and classification task. We extend our prediction analysis for mismatched and unlabelled datasets. For this framework, we analyze the performance of four machine learning techniques: Random Forest, Gradient Boost, XGBoost and AdaBoost in the detection part. With our proposed method, we can achieve a false negative rate as low as 0.0037 while maintaining high accuracy of 99.63%. Moreover, we devise a novel unsupervised technique for data clustering using K- Means algorithm for the visual analysis. This paper analyses the vulnerability of decision tree-based models using the limited knowledge attack scenario. We considered the exploratory attack and implemented Zeroth Order Optimization adversarial attack on the detection models.
    Medical Image Segmentation with 3D Convolutional Neural Networks: A Survey. (arXiv:2108.08467v3 [eess.IV] UPDATED)
    Computer-aided medical image analysis plays a significant role in assisting medical practitioners for expert clinical diagnosis and deciding the optimal treatment plan. At present, convolutional neural networks (CNN) are the preferred choice for medical image analysis. In addition, with the rapid advancements in three-dimensional (3D) imaging systems and the availability of excellent hardware and software support to process large volumes of data, 3D deep learning methods are gaining popularity in medical image analysis. Here, we present an extensive review of the recently evolved 3D deep learning methods in medical image segmentation. Furthermore, the research gaps and future directions in 3D medical image segmentation are discussed.
    Improving the Robustness of Federated Learning for Severely Imbalanced Datasets. (arXiv:2204.13414v1 [cs.LG])
    With the ever increasing data deluge and the success of deep neural networks, the research of distributed deep learning has become pronounced. Two common approaches to achieve this distributed learning is synchronous and asynchronous weight update. In this manuscript, we have explored very simplistic synchronous weight update mechanisms. It has been seen that with an increasing number of worker nodes, the performance degrades drastically. This effect has been studied in the context of extreme imbalanced classification (e.g. outlier detection). In practical cases, the assumed conditions of i.i.d. may not be fulfilled. There may also arise global class imbalance situations like that of outlier detection where the local servers receive severely imbalanced data and may not get any samples from the minority class. In that case, the DNNs in the local servers will get completely biased towards the majority class that they receive. This would highly impact the learning at the parameter server (which practically does not see any data). It has been observed that in a parallel setting if one uses the existing federated weight update mechanisms at the parameter server, the performance degrades drastically with the increasing number of worker nodes. This is mainly because, with the increasing number of nodes, there is a high chance that one worker node gets a very small portion of the data, either not enough to train the model without overfitting or having a highly imbalanced class distribution. The chapter, hence, proposes a workaround to this problem by introducing the concept of adaptive cost-sensitive momentum averaging. It is seen that for the proposed system, there was no to minimal degradation in performance while most of the other methods hit their bottom performance before that.
    Policy Gradient Approach to Compilation of Variational Quantum Circuits. (arXiv:2111.10227v2 [quant-ph] UPDATED)
    We propose a method for finding approximate compilations of quantum unitary transformations, based on techniques from policy gradient reinforcement learning. The choice of a stochastic policy allows us to rephrase the optimization problem in terms of probability distributions, rather than variational gates. In this framework, finding the optimal configuration is done by optimizing over distribution parameters, rather than over free angles. We show numerically that this approach can be more competitive than gradient-free methods, for comparable amounts of resources (i.e. quantum circuit runs). Another interesting feature of this approach to variational compilation is that it does not need a separate register and long-range interactions to estimate the end-point fidelity, which is an improvement over methods which rely on the Hilbert-Schmidt test. We expect these techniques to be relevant for training variational circuits in other contexts.
    Vitruvion: A Generative Model of Parametric CAD Sketches. (arXiv:2109.14124v2 [cs.LG] UPDATED)
    Parametric computer-aided design (CAD) tools are the predominant way that engineers specify physical structures, from bicycle pedals to airplanes to printed circuit boards. The key characteristic of parametric CAD is that design intent is encoded not only via geometric primitives, but also by parameterized constraints between the elements. This relational specification can be viewed as the construction of a constraint program, allowing edits to coherently propagate to other parts of the design. Machine learning offers the intriguing possibility of accelerating the design process via generative modeling of these structures, enabling new tools such as autocompletion, constraint inference, and conditional synthesis. In this work, we present such an approach to generative modeling of parametric CAD sketches, which constitute the basic computational building blocks of modern mechanical design. Our model, trained on real-world designs from the SketchGraphs dataset, autoregressively synthesizes sketches as sequences of primitives, with initial coordinates, and constraints that reference back to the sampled primitives. As samples from the model match the constraint graph representation used in standard CAD software, they may be directly imported, solved, and edited according to downstream design tasks. In addition, we condition the model on various contexts, including partial sketches (primers) and images of hand-drawn sketches. Evaluation of the proposed approach demonstrates its ability to synthesize realistic CAD sketches and its potential to aid the mechanical design workflow.
    Optimal Transport for Unsupervised Denoising Learning. (arXiv:2108.02574v4 [eess.IV] UPDATED)
    Recently, much progress has been made in unsupervised denoising learning. However, existing methods more or less rely on some assumptions on the signal and/or degradation model, which limits their practical performance. How to construct an optimal criterion for unsupervised denoising learning without any prior knowledge on the degradation model is still an open question. Toward answering this question, this work proposes a criterion for unsupervised denoising learning based on the optimal transport theory. This criterion has favorable properties, e.g., approximately maximal preservation of the information of the signal, whilst achieving perceptual reconstruction. Furthermore, though a relaxed unconstrained formulation is used in practical implementation, we prove that the relaxed formulation in theory has the same solution as the original constrained formulation. Experiments on synthetic and real-world data, including realistic photographic, microscopy, depth, and raw depth images, demonstrate that the proposed method even compares favorably with supervised methods, e.g., approaching the PSNR of supervised methods while having better perceptual quality. Particularly, for spatially correlated noise and realistic microscopy images, the proposed method not only achieves better perceptual quality but also has higher PSNR than supervised methods. Besides, it shows remarkable superiority in harsh practical conditions with complex noise, e.g., raw depth images. Code is available at https://github.com/wangweiSJTU/OTUR.
    Deepfake Forensics via An Adversarial Game. (arXiv:2103.13567v2 [cs.CV] UPDATED)
    With the progress in AI-based facial forgery (i.e., deepfake), people are increasingly concerned about its abuse. Albeit effort has been made for training classification (also known as deepfake detection) models to recognize such forgeries, existing models suffer from poor generalization to unseen forgery technologies and high sensitivity to changes in image/video quality. In this paper, we advocate adversarial training for improving the generalization ability to both unseen facial forgeries and unseen image/video qualities. We believe training with samples that are adversarially crafted to attack the classification models improves the generalization ability considerably. Considering that AI-based face manipulation often leads to high-frequency artifacts that can be easily spotted by models yet difficult to generalize, we further propose a new adversarial training method that attempts to blur out these specific artifacts, by introducing pixel-wise Gaussian blurring models. With adversarial training, the classification models are forced to learn more discriminative and generalizable features, and the effectiveness of our method can be verified by plenty of empirical evidence. Our code will be made publicly available.
    Multi Type Mean Field Reinforcement Learning. (arXiv:2002.02513v6 [cs.MA] UPDATED)
    Mean field theory provides an effective way of scaling multiagent reinforcement learning algorithms to environments with many agents that can be abstracted by a virtual mean agent. In this paper, we extend mean field multiagent algorithms to multiple types. The types enable the relaxation of a core assumption in mean field reinforcement learning, which is that all agents in the environment are playing almost similar strategies and have the same goal. We conduct experiments on three different testbeds for the field of many agent reinforcement learning, based on the standard MAgents framework. We consider two different kinds of mean field environments: a) Games where agents belong to predefined types that are known a priori and b) Games where the type of each agent is unknown and therefore must be learned based on observations. We introduce new algorithms for each type of game and demonstrate their superior performance over state of the art algorithms that assume that all agents belong to the same type and other baseline algorithms in the MAgent framework.
    KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradients. (arXiv:2204.13683v1 [cs.RO])
    Simulators offer the possibility of safe, low-cost development of self-driving systems. However, current driving simulators exhibit na\"ive behavior models for background traffic. Hand-tuned scenarios are typically added during simulation to induce safety-critical situations. An alternative approach is to adversarially perturb the background traffic trajectories. In this paper, we study this approach to safety-critical driving scenario generation using the CARLA simulator. We use a kinematic bicycle model as a proxy to the simulator's true dynamics and observe that gradients through this proxy model are sufficient for optimizing the background traffic trajectories. Based on this finding, we propose KING, which generates safety-critical driving scenarios with a 20% higher success rate than black-box optimization. By solving the scenarios generated by KING using a privileged rule-based expert algorithm, we obtain training data for an imitation learning policy. After fine-tuning on this new data, we show that the policy becomes better at avoiding collisions. Importantly, our generated data leads to reduced collisions on both held-out scenarios generated via KING as well as traditional hand-crafted scenarios, demonstrating improved robustness.
    Fuzzy Cognitive Maps and Hidden Markov Models: Comparative Analysis of Efficiency within the Confines of the Time Series Classification Task. (arXiv:2204.13455v1 [cs.LG])
    Time series classification is one of the very popular machine learning tasks. In this paper, we explore the application of Hidden Markov Model (HMM) for time series classification. We distinguish between two modes of HMM application. The first, in which a single model is built for each class. The second, in which one HMM is built for each time series. We then transfer both approaches for classifier construction to the domain of Fuzzy Cognitive Maps. The identified four models, HMM NN (HMM, one per series), HMM 1C (HMM, one per class), FCM NN, and FCM 1C are then studied in a series of experiments. We compare the performance of different models and investigate the impact of their hyperparameters on the time series classification accuracy. The empirical evaluation shows a clear advantage of the one-model-per-series approach. The results show that the choice between HMM and FCM should be dataset-dependent.
    Autoencoder based Hybrid Multi-Task Predictor Network for Daily Open-High-Low-Close Prices Prediction of Indian Stocks. (arXiv:2204.13422v1 [cs.LG])
    Stock prices are highly volatile and sudden changes in trends are often very problematic for traditional forecasting models to handle. The standard Long Short Term Memory (LSTM) networks are regarded as the state-of-the-art models for such predictions. But, these models fail to handle sudden and drastic changes in the price trend. Moreover, there are some inherent constraints with the open, high, low and close (OHLC) prices of the stocks. Literature lacks the study on the inherent property of OHLC prices. We argue that predicting the OHLC prices for the next day is much more informative than predicting the trends of the stocks as the trend is mostly calculated using these OHLC prices only. The problem mainly is focused on Buy-Today Sell-Tomorrow (BTST) trading. In this regard, AEs when pre-trained with the stock prices, may be beneficial. A novel framework is proposed where a pre-trained encoder is cascaded in front of the multi-task predictor network. This hybrid network can leverage the power of a combination of networks and can both handle the OHLC constraints as well as capture any sudden drastic changes in the prices. It is seen that such a network is much more efficient at predicting stock prices. The experiments have been extended to recommend the most profitable and most overbought stocks on the next day. The model has been tested for multiple Indian companies and it is found that the recommendations from the proposed model have not resulted in a single loss for a test period of 300 days.
    Cumulative Stay-time Representation for Electronic Health Records in Medical Event Time Prediction. (arXiv:2204.13451v1 [cs.LG])
    We address the problem of predicting when a disease will develop, i.e., medical event time (MET), from a patient's electronic health record (EHR). The MET of non-communicable diseases like diabetes is highly correlated to cumulative health conditions, more specifically, how much time the patient spent with specific health conditions in the past. The common time-series representation is indirect in extracting such information from EHR because it focuses on detailed dependencies between values in successive observations, not cumulative information. We propose a novel data representation for EHR called cumulative stay-time representation (CTR), which directly models such cumulative health conditions. We derive a trainable construction of CTR based on neural networks that has the flexibility to fit the target data and scalability to handle high-dimensional EHR. Numerical experiments using synthetic and real-world datasets demonstrate that CTR alone achieves a high prediction performance, and it enhances the performance of existing models when combined with them.
    Reusability and Transferability of Macro Actions for Reinforcement Learning. (arXiv:1908.01478v3 [cs.NE] UPDATED)
    Conventional reinforcement learning (RL) typically determines an appropriate primitive action at each timestep. However, by using a proper macro action, defined as a sequence of primitive actions, an agent is able to bypass intermediate states to a farther state and facilitate its learning procedure. The problem we would like to investigate is what associated beneficial properties that macro actions may possess. In this paper, we unveil the properties of reusability and transferability of macro actions. The first property, reusability, means that a macro action generated along with one RL method can be reused by another RL method for training, while the second one, transferability, means that a macro action can be utilized for training agents in similar environments with different reward settings. In our experiments, we first generate macro actions along with RL methods. We then provide a set of analyses to reveal the properties of reusability and transferability of the generated macro actions.
    A unified theory of information transfer and causal relation. (arXiv:2204.13598v1 [cond-mat.stat-mech])
    Information transfer between coupled stochastic dynamics, measured by transfer entropy and information flow, is suggested as a physical process underlying the causal relation of systems. While information transfer analysis has booming applications in both science and engineering fields, critical mysteries about its foundations remain unsolved. Fundamental yet difficult questions concern how information transfer and causal relation originate, what they depend on, how they differ from each other, and if they are created by a unified and general quantity. These questions essentially determine the validity of causal relation measurement via information transfer. Here we pursue to lay a complete theoretical basis of information transfer and causal relation. Beyond the well-known relations between these concepts that conditionally hold, we demonstrate that information transfer and causal relation universally originate from specific information synergy and redundancy phenomena characterized by high-order mutual information. More importantly, our theory analytically explains the mechanisms for information transfer and causal relation to originate, vanish, and differ from each other. Moreover, our theory naturally defines the effect sizes of information transfer and causal relation based on high-dimensional coupling events. These results may provide a unified view of information, synergy, and causal relation to bridge Pearl's causal inference theory in computer science and information transfer analysis in physics.
    Unaligned Supervision For Automatic Music Transcription in The Wild. (arXiv:2204.13668v1 [cs.SD])
    Multi-instrument Automatic Music Transcription (AMT), or the decoding of a musical recording into semantic musical content, is one of the holy grails of Music Information Retrieval. Current AMT approaches are restricted to piano and (some) guitar recordings, due to difficult data collection. In order to overcome data collection barriers, previous AMT approaches attempt to employ musical scores in the form of a digitized version of the same song or piece. The scores are typically aligned using audio features and strenuous human intervention to generate training labels. We introduce NoteEM, a method for simultaneously training a transcriber and aligning the scores to their corresponding performances, in a fully-automated process. Using this unaligned supervision scheme, complemented by pseudo-labels and pitch-shift augmentation, our method enables training on in-the-wild recordings with unprecedented accuracy and instrumental variety. Using only synthetic data and unaligned supervision, we report SOTA note-level accuracy of the MAPS dataset, and large favorable margins on cross-dataset evaluations. We also demonstrate robustness and ease of use; we report comparable results when training on a small, easily obtainable, self-collected dataset, and we propose alternative labeling to the MusicNet dataset, which we show to be more accurate. Our project page is available at https://benadar293.github.io
    Process-BERT: A Framework for Representation Learning on Educational Process Data. (arXiv:2204.13607v1 [cs.LG])
    Educational process data, i.e., logs of detailed student activities in computerized or online learning platforms, has the potential to offer deep insights into how students learn. One can use process data for many downstream tasks such as learning outcome prediction and automatically delivering personalized intervention. However, analyzing process data is challenging since the specific format of process data varies a lot depending on different learning/testing scenarios. In this paper, we propose a framework for learning representations of educational process data that is applicable across many different learning scenarios. Our framework consists of a pre-training step that uses BERT-type objectives to learn representations from sequential process data and a fine-tuning step that further adjusts these representations on downstream prediction tasks. We apply our framework to the 2019 nation's report card data mining competition dataset that consists of student problem-solving process data and detail the specific models we use in this scenario. We conduct both quantitative and qualitative experiments to show that our framework results in process data representations that are both predictive and informative.
    Unlocking High-Accuracy Differentially Private Image Classification through Scale. (arXiv:2204.13650v1 [cs.LG])
    Differential Privacy (DP) provides a formal privacy guarantee preventing adversaries with access to a machine learning model from extracting information about individual training points. Differentially Private Stochastic Gradient Descent (DP-SGD), the most popular DP training method, realizes this protection by injecting noise during training. However previous works have found that DP-SGD often leads to a significant degradation in performance on standard image classification benchmarks. Furthermore, some authors have postulated that DP-SGD inherently performs poorly on large models, since the norm of the noise required to preserve privacy is proportional to the model dimension. In contrast, we demonstrate that DP-SGD on over-parameterized models can perform significantly better than previously thought. Combining careful hyper-parameter tuning with simple techniques to ensure signal propagation and improve the convergence rate, we obtain a new SOTA on CIFAR-10 of 81.4% under (8, 10^{-5})-DP using a 40-layer Wide-ResNet, improving over the previous SOTA of 71.7%. When fine-tuning a pre-trained 200-layer Normalizer-Free ResNet, we achieve a remarkable 77.1% top-1 accuracy on ImageNet under (1, 8*10^{-7})-DP, and achieve 81.1% under (8, 8*10^{-7})-DP. This markedly exceeds the previous SOTA of 47.9% under a larger privacy budget of (10, 10^{-6})-DP. We believe our results are a significant step towards closing the accuracy gap between private and non-private image classification.
    Bona fide Riesz projections for density estimation. (arXiv:2204.13606v1 [eess.SP])
    The projection of sample measurements onto a reconstruction space represented by a basis on a regular grid is a powerful and simple approach to estimate a probability density function. In this paper, we focus on Riesz bases and propose a projection operator that, in contrast to previous works, guarantees the bona fide properties for the estimate, namely, non-negativity and total probability mass $1$. Our bona fide projection is defined as a convex problem. We propose solution techniques and evaluate them. Results suggest an improved performance, specifically in circumstances prone to rippling effects.
    Foundations for learning from noisy quantum experiments. (arXiv:2204.13691v1 [quant-ph])
    Understanding what can be learned from experiments is central to scientific progress. In this work, we use a learning-theoretic perspective to study the task of learning physical operations in a quantum machine when all operations (state preparation, dynamics, and measurement) are a priori unknown. We prove that, without any prior knowledge, if one can explore the full quantum state space by composing the operations, then every operation can be learned. When one cannot explore the full state space but all operations are approximately known and noise in Clifford gates is gate-independent, we find an efficient algorithm for learning all operations up to a single unlearnable parameter characterizing the fidelity of the initial state. For learning a noise channel on Clifford gates to a fixed accuracy, our algorithm uses quadratically fewer experiments than previously known protocols. Under more general conditions, the true description of the noise can be unlearnable; for example, we prove that no benchmarking protocol can learn gate-dependent Pauli noise on Clifford+T gates even under perfect state preparation and measurement. Despite not being able to learn the noise, we show that a noisy quantum computer that performs entangled measurements on multiple copies of an unknown state can yield a large advantage in learning properties of the state compared to a noiseless device that measures individual copies and then processes the measurement data using a classical computer. Concretely, we prove that noisy quantum computers with two-qubit gate error rate $\epsilon$ can achieve a learning task using $N$ copies of the state, while $N^{\Omega(1/\epsilon)}$ copies are required classically.
    Phase Shift Design in RIS Empowered Wireless Networks: From Optimization to AI-Based Methods. (arXiv:2204.13372v1 [cs.LG])
    Reconfigurable intelligent surfaces (RISs) have a revolutionary capability to customize the radio propagation environment for wireless networks. To fully exploit the advantages of RISs in wireless systems, the phases of the reflecting elements must be jointly designed with conventional communication resources, such as beamformers, transmit power, and computation time. However, due to the unique constraints on the phase shift, and massive numbers of reflecting units and users in large-scale networks, the resulting optimization problems are challenging to solve. This paper provides a review of current optimization methods and artificial intelligence-based methods for handling the constraints imposed by RIS and compares them in terms of solution quality and computational complexity. Future challenges in phase shift optimization involving RISs are also described and potential solutions are discussed.
    It's DONE: Direct ONE-shot learning without training optimization. (arXiv:2204.13361v1 [cs.LG])
    Learning a new concept from one example is a superior function of human brain and it is drawing attention in the field of machine learning as one-shot learning task. In this paper, we propose the simplest method for this task, named Direct ONE-shot learning (DONE). DONE adds a new class to a pretrained deep neural network (DNN) classifier with neither training optimization nor other-classes modification. DONE is inspired by Hebbian theory and directly uses the neural activity input of the final dense layer obtained from a data that belongs to the new additional class as the connectivity weight (synaptic strength) with a newly-provided-output neuron for the new class. DONE requires just one inference for obtaining the output of the final dense layer and its procedure is simple, deterministic, not requiring parameter tuning and hyperparameters. The performance of DONE depends entirely on the pretrained DNN model used as a backbone model, and we confirmed that DONE with a well-trained backbone model performs a practical-level accuracy. DONE has some advantages including a DNN's practical use that is difficult to spend high cost for a training, an evaluation of existing DNN models, and the understanding of the brain. DONE might be telling us one-shot learning is an easy task that can be achieved by a simple principle not only for humans but also for current well-trained DNN models.
    List-Mode PET Image Reconstruction Using Deep Image Prior. (arXiv:2204.13404v1 [physics.med-ph])
    List-mode positron emission tomography (PET) image reconstruction is an important tool for PET scanners with many lines-of-response (LORs) and additional information such as time-of-flight and depth-of-interaction. Deep learning is one possible solution to enhance the quality of PET image reconstruction. However, the application of deep learning techniques to list-mode PET image reconstruction have not been progressed because list data is a sequence of bit codes and unsuitable for processing by convolutional neural networks (CNN). In this study, we propose a novel list-mode PET image reconstruction method using an unsupervised CNN called deep image prior (DIP) and a framework of alternating direction method of multipliers. The proposed list-mode DIP reconstruction (LM-DIPRecon) method alternatively iterates regularized list-mode dynamic row action maximum likelihood algorithm (LM-DRAMA) and magnetic resonance imaging conditioned DIP (MR-DIP). We evaluated LM-DIPRecon using both simulation and clinical data, and it achieved sharper images and better tradeoff curves between contrast and noise than the LM-DRAMA and MR-DIP. These results indicated that the LM-DIPRecon is useful for quantitative PET imaging with limited events. In addition, as list data has finer temporal information than dynamic sinograms, list-mode deep image prior reconstruction is expected to be useful for 4D PET imaging and motion correction.
    Poisoning Deep Learning based Recommender Model in Federated Learning Scenarios. (arXiv:2204.13594v1 [cs.IR])
    Various attack methods against recommender systems have been proposed in the past years, and the security issues of recommender systems have drawn considerable attention. Traditional attacks attempt to make target items recommended to as many users as possible by poisoning the training data. Benifiting from the feature of protecting users' private data, federated recommendation can effectively defend such attacks. Therefore, quite a few works have devoted themselves to developing federated recommender systems. For proving current federated recommendation is still vulnerable, in this work we probe to design attack approaches targeting deep learning based recommender models in federated learning scenarios. Specifically, our attacks generate poisoned gradients for manipulated malicious users to upload based on two strategies (i.e., random approximation and hard user mining). Extensive experiments show that our well-designed attacks can effectively poison the target models, and the attack effectiveness sets the state-of-the-art.
    Model Selection, Adaptation, and Combination for Deep Transfer Learning through Neural Networks in Renewable Energies. (arXiv:2204.13293v1 [cs.LG])
    There is recent interest in using model hubs, a collection of pre-trained models, in computer vision tasks. To utilize the model hub, we first select a source model and then adapt the model for the target to compensate for differences. While there is yet limited research on a model selection and adaption for computer vision tasks, this holds even more for the field of renewable power. At the same time, it is a crucial challenge to provide forecasts for the increasing demand for power forecasts based on weather features from a numerical weather prediction. We close these gaps by conducting the first thorough experiment for model selection and adaptation for transfer learning in renewable power forecast, adopting recent results from the field of computer vision on six datasets. We adopt models based on data from different seasons and limit the amount of training data. As an extension of the current state of the art, we utilize a Bayesian linear regression for forecasting the response based on features extracted from a neural network. This approach outperforms the baseline with only seven days of training data. We further show how combining multiple models through ensembles can significantly improve the model selection and adaptation approach. In fact, with more than 30 days of training data, both proposed model combination techniques achieve similar results to those models trained with a full year of training data.
    On Parametric Optimal Execution and Machine Learning Surrogates. (arXiv:2204.08581v2 [q-fin.TR] UPDATED)
    We investigate optimal order execution problems in discrete time with instantaneous price impact and stochastic resilience. First, in the setting of linear transient price impact we derive a closed-form recursion for the optimal strategy, extending the deterministic results from Obizhaeva and Wang (J Financial Markets, 2013). Second, we develop a numerical algorithm based on dynamic programming and deep learning for the case of nonlinear transient price impact as proposed by Bouchaud et al. (Quant. Finance, 2004). Specifically, we utilize an actor-critic framework that constructs two neural-network (NN) surrogates for the value function and the feedback control. The flexible scalability of NN functional approximators enables parametric learning, i.e., incorporating several model or market parameters as part of the input space. Precise calibration of price impact, resilience, etc., is known to be extremely challenging and hence it is critical to understand sensitivity of the execution policy to these parameters. Our NN learner organically scales across multiple input dimensions and is shown to accurately approximate optimal strategies across a wide range of parameter configurations. We provide a fully reproducible Jupyter Notebook with our NN implementation, which is of independent pedagogical interest, demonstrating the ease of use of NN surrogates in (parametric) stochastic control problems.
    Predicting Sleeping Quality using Convolutional Neural Networks. (arXiv:2204.13584v1 [eess.SP])
    Identifying sleep stages and patterns is an essential part of diagnosing and treating sleep disorders. With the advancement of smart technologies, sensor data related to sleeping patterns can be captured easily. In this paper, we propose a Convolution Neural Network (CNN) architecture that improves the classification performance. In particular, we benchmark the classification performance from different methods, including traditional machine learning methods such as Logistic Regression (LR), Decision Trees (DT), k-Nearest Neighbour (k-NN), Naive Bayes (NB) and Support Vector Machine (SVM), on 3 publicly available sleep datasets. The accuracy, sensitivity, specificity, precision, recall, and F-score are reported and will serve as a baseline to simulate the research in this direction in the future.
    Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss. (arXiv:2204.13437v1 [cs.SD])
    Recent deep learning Text-to-Speech (TTS) systems have achieved impressive performance by generating speech close to human parity. However, they suffer from training stability issues as well as incorrect alignment of the intermediate acoustic representation with the input text sequence. In this work, we introduce Regotron, a regularized version of Tacotron2 which aims to alleviate the training issues and at the same time produce monotonic alignments. Our method augments the vanilla Tacotron2 objective function with an additional term, which penalizes non-monotonic alignments in the location-sensitive attention mechanism. By properly adjusting this regularization term we show that the loss curves become smoother, and at the same time Regotron consistently produces monotonic alignments in unseen examples even at an early stage (13\% of the total number of epochs) of its training process, whereas the fully converged Tacotron2 fails to do so. Moreover, our proposed regularization method has no additional computational overhead, while reducing common TTS mistakes and achieving slighlty improved speech naturalness according to subjective mean opinion scores (MOS) collected from 50 evaluators.
    Multi-Player Multi-Armed Bandits with Finite Shareable Resources Arms: Learning Algorithms & Applications. (arXiv:2204.13502v1 [cs.LG])
    Multi-player multi-armed bandits (MMAB) study how decentralized players cooperatively play the same multi-armed bandit so as to maximize their total cumulative rewards. Existing MMAB models mostly assume when more than one player pulls the same arm, they either have a collision and obtain zero rewards, or have no collision and gain independent rewards, both of which are usually too restrictive in practical scenarios. In this paper, we propose an MMAB with shareable resources as an extension to the collision and non-collision settings. Each shareable arm has finite shareable resources and a "per-load" reward random variable, both of which are unknown to players. The reward from a shareable arm is equal to the "per-load" reward multiplied by the minimum between the number of players pulling the arm and the arm's maximal shareable resources. We consider two types of feedback: sharing demand information (SDI) and sharing demand awareness (SDA), each of which provides different signals of resource sharing. We design the DPE-SDI and SIC-SDA algorithms to address the shareable arm problem under these two cases of feedback respectively and prove that both algorithms have logarithmic regrets that are tight in the number of rounds. We conduct simulations to validate both algorithms' performance and show their utilities in wireless networking and edge computing.
    Predicting batch queue job wait times for informed scheduling of urgent HPC workloads. (arXiv:2204.13543v1 [cs.DC])
    There is increasing interest in the use of HPC machines for urgent workloads to help tackle disasters as they unfold. Whilst batch queue systems are not ideal in supporting such workloads, many disadvantages can be worked around by accurately predicting when a waiting job will start to run. However there are numerous challenges in achieving such a prediction with high accuracy, not least because the queue's state can change rapidly and depend upon many factors. In this work we explore a novel machine learning approach for predicting queue wait times, hypothesising that such a model can capture the complex behaviour resulting from the queue policy and other interactions to generate accurate job start times. For ARCHER2 (HPE Cray EX), Cirrus (HPE 8600) and 4-cabinet (HPE Cray EX) we explore how different machine learning approaches and techniques improve the accuracy of our predictions, comparing against the estimation generated by Slurm. We demonstrate that our techniques deliver the most accurate predictions across our machines of interest, with the result of this work being the ability to predict job start times within one minute of the actual start time for around 65\% of jobs on ARCHER2 and 4-cabinet, and 76\% of jobs on Cirrus. When compared against what Slurm can deliver, this represents around 3.8 times better accuracy on ARCHER2 and 18 times better for Cirrus. Furthermore our approach can accurately predicting the start time for three quarters of all job within ten minutes of the actual start time on ARCHER2 and 4-cabinet, and for 90\% of jobs on Cirrus. Whilst the driver of this work has been to better facilitate placement of urgent workloads across HPC machines, the insights gained can be used to provide wider benefits to users and also enrich existing batch queue systems and inform policy too.
    BI-GreenNet: Learning Green's functions by boundary integral network. (arXiv:2204.13247v1 [cs.LG])
    Green's function plays a significant role in both theoretical analysis and numerical computing of partial differential equations (PDEs). However, in most cases, Green's function is difficult to compute. The troubles arise in the following three folds. Firstly, compared with the original PDE, the dimension of Green's function is doubled, making it impossible to be handled by traditional mesh-based methods. Secondly, Green's function usually contains singularities which increase the difficulty to get a good approximation. Lastly, the computational domain may be very complex or even unbounded. To override these problems, we leverage the fundamental solution, boundary integral method and neural networks to develop a new method for computing Green's function with high accuracy in this paper. We focus on Green's function of Poisson and Helmholtz equations in bounded domains, unbounded domains. We also consider Poisson equation and Helmholtz domains with interfaces. Extensive numerical experiments illustrate the efficiency and the accuracy of our method for solving Green's function. In addition, we also use the Green's function calculated by our method to solve a class of PDE, and also obtain high-precision solutions, which shows the good generalization ability of our method on solving PDEs.
    Deep graph matching meets mixed-integer linear programming: Relax at your own risk ?. (arXiv:2108.00394v5 [cs.CV] UPDATED)
    Graph matching is an important problem that has received widespread attention, especially in the field of computer vision. Recently, state-of-the-art methods seek to incorporate graph matching with deep learning. However, there is no research to explain what role the graph matching algorithm plays in the model. Therefore, we propose an approach integrating a MILP formulation of the graph matching problem. This formulation is solved to optimal and it provides inherent baseline. Meanwhile, similar approaches are derived by releasing the optimal guarantee of the graph matching solver and by introducing a quality level. This quality level controls the quality of the solutions provided by the graph matching solver. In addition, several relaxations of the graph matching problem are put to the test. Our experimental evaluation gives several theoretical insights and guides the direction of deep graph matching methods.
    Predicting single-cell perturbation responses for unseen drugs. (arXiv:2204.13545v1 [cs.LG])
    Single-cell transcriptomics enabled the study of cellular heterogeneity in response to perturbations at the resolution of individual cells. However, scaling high-throughput screens (HTSs) to measure cellular responses for many drugs remains a challenge due to technical limitations and, more importantly, the cost of such multiplexed experiments. Thus, transferring information from routinely performed bulk RNA-seq HTS is required to enrich single-cell data meaningfully. We introduce a new encoder-decoder architecture to study the perturbational effects of unseen drugs. We combine the model with a transfer learning scheme and demonstrate how training on existing bulk RNA-seq HTS datasets can improve generalisation performance. Better generalisation reduces the need for extensive and costly screens at single-cell resolution. We envision that our proposed method will facilitate more efficient experiment designs through its ability to generate in-silico hypotheses, ultimately accelerating targeted drug discovery.
    Exploring How Anomalous Model Input and Output Alerts Affect Decision-Making in Healthcare. (arXiv:2204.13194v1 [cs.HC])
    An important goal in the field of human-AI interaction is to help users more appropriately trust AI systems' decisions. A situation in which the user may particularly benefit from more appropriate trust is when the AI receives anomalous input or provides anomalous output. To the best of our knowledge, this is the first work towards understanding how anomaly alerts may contribute to appropriate trust of AI. In a formative mixed-methods study with 4 radiologists and 4 other physicians, we explore how AI alerts for anomalous input, very high and low confidence, and anomalous saliency-map explanations affect users' experience with mockups of an AI clinical decision support system (CDSS) for evaluating chest x-rays for pneumonia. We find evidence suggesting that the four anomaly alerts are desired by non-radiologists, and the high-confidence alerts are desired by both radiologists and non-radiologists. In a follow-up user study, we investigate how high- and low-confidence alerts affect the accuracy and thus appropriate trust of 33 radiologists working with AI CDSS mockups. We observe that these alerts do not improve users' accuracy or experience and discuss potential reasons why.
    Machine Learning for Violence Risk Assessment Using Dutch Clinical Notes. (arXiv:2204.13535v1 [cs.LG])
    Violence risk assessment in psychiatric institutions enables interventions to avoid violence incidents. Clinical notes written by practitioners and available in electronic health records are valuable resources capturing unique information, but are seldom used to their full potential. We explore conventional and deep machine learning methods to assess violence risk in psychiatric patients using practitioner notes. The performance of our best models is comparable to the currently used questionnaire-based method, with an area under the Receiver Operating Characteristic curve of approximately 0.8. We find that the deep-learning model BERTje performs worse than conventional machine learning methods. We also evaluate our data and our classifiers to understand the performance of our models better. This is particularly important for the applicability of evaluated classifiers to new data, and is also of great interest to practitioners, due to the increased availability of new data in electronic format.
    Learning Storm Surge with Gradient Boosting. (arXiv:2204.13168v1 [cs.CE])
    Storm surge is a major natural hazard for coastal regions, responsible both for significant property damage and loss of life. Accurate, efficient models of storm surge are needed both to assess long-term risk and to guide emergency management decisions. While high-fidelity ocean circulation models such as the ADvanced CIRCulation (ADCIRC) model can accurately predict storm surge, they are very computationally expensive. Consequently, there have been a number of efforts in recent years to develop data-driven surrogate models for storm surge. While these models can attain good accuracy and are highly efficient, they are often limited to a small geographical region and a fixed set of output locations. We develop a novel surrogate model for peak storm surge prediction based on gradient boosting. Unlike most surrogate approaches, our model is not explicitly constrained to a fixed set of output locations or specific geographical region. The model is trained with a database of 446 synthetic storms that make landfall on the Texas coast and obtains a mean absolute error of 0.25 meters. We additionally present a test of the model on Hurricanes Ike (2008) and Harvey (2017).
    FedShuffle: Recipes for Better Use of Local Work in Federated Learning. (arXiv:2204.13169v1 [cs.LG])
    The practice of applying several local updates before aggregation across clients has been empirically shown to be a successful approach to overcoming the communication bottleneck in Federated Learning (FL). In this work, we propose a general recipe, FedShuffle, that better utilizes the local updates in FL, especially in the heterogeneous regime. Unlike many prior works, FedShuffle does not assume any uniformity in the number of updates per device. Our FedShuffle recipe comprises four simple-yet-powerful ingredients: 1) local shuffling of the data, 2) adjustment of the local learning rates, 3) update weighting, and 4) momentum variance reduction (Cutkosky and Orabona, 2019). We present a comprehensive theoretical analysis of FedShuffle and show that both theoretically and empirically, our approach does not suffer from the objective function mismatch that is present in FL methods which assume homogeneous updates in heterogeneous FL setups, e.g., FedAvg (McMahan et al., 2017). In addition, by combining the ingredients above, FedShuffle improves upon FedNova (Wang et al., 2020), which was previously proposed to solve this mismatch. We also show that FedShuffle with momentum variance reduction can improve upon non-local methods under a Hessian similarity assumption. Finally, through experiments on synthetic and real-world datasets, we illustrate how each of the four ingredients used in FedShuffle helps improve the use of local updates in FL.
    Learning-to-Rank at the Speed of Sampling: Plackett-Luce Gradient Estimation With Minimal Computational Complexity. (arXiv:2204.10872v2 [cs.LG] UPDATED)
    Plackett-Luce gradient estimation enables the optimization of stochastic ranking models within feasible time constraints through sampling techniques. Unfortunately, the computational complexity of existing methods does not scale well with the length of the rankings, i.e. the ranking cutoff, nor with the item collection size. In this paper, we introduce the novel PL-Rank-3 algorithm that performs unbiased gradient estimation with a computational complexity comparable to the best sorting algorithms. As a result, our novel learning-to-rank method is applicable in any scenario where standard sorting is feasible in reasonable time. Our experimental results indicate large gains in the time required for optimization, without any loss in performance. For the field, our contribution could potentially allow state-of-the-art learning-to-rank methods to be applied to much larger scales than previously feasible.
    Unsupervised Spatial-spectral Hyperspectral Image Reconstruction and Clustering with Diffusion Geometry. (arXiv:2204.13497v1 [cs.CV])
    Hyperspectral images, which store a hundred or more spectral bands of reflectance, have become an important data source in natural and social sciences. Hyperspectral images are often generated in large quantities at a relatively coarse spatial resolution. As such, unsupervised machine learning algorithms incorporating known structure in hyperspectral imagery are needed to analyze these images automatically. This work introduces the Spatial-Spectral Image Reconstruction and Clustering with Diffusion Geometry (DSIRC) algorithm for partitioning highly mixed hyperspectral images. DSIRC reduces measurement noise through a shape-adaptive reconstruction procedure. In particular, for each pixel, DSIRC locates spectrally correlated pixels within a data-adaptive spatial neighborhood and reconstructs that pixel's spectral signature using those of its neighbors. DSIRC then locates high-density, high-purity pixels far in diffusion distance (a data-dependent distance metric) from other high-density, high-purity pixels and treats these as cluster exemplars, giving each a unique label. Non-modal pixels are assigned the label of their diffusion distance-nearest neighbor of higher density and purity that is already labeled. Strong numerical results indicate that incorporating spatial information through image reconstruction substantially improves the performance of pixel-wise clustering.
    On the Normalizing Constant of the Continuous Categorical Distribution. (arXiv:2204.13290v1 [stat.ML])
    Probability distributions supported on the simplex enjoy a wide range of applications across statistics and machine learning. Recently, a novel family of such distributions has been discovered: the continuous categorical. This family enjoys remarkable mathematical simplicity; its density function resembles that of the Dirichlet distribution, but with a normalizing constant that can be written in closed form using elementary functions only. In spite of this mathematical simplicity, our understanding of the normalizing constant remains far from complete. In this work, we characterize the numerical behavior of the normalizing constant and we present theoretical and methodological advances that can, in turn, help to enable broader applications of the continuous categorical distribution. Our code is available at https://github.com/cunningham-lab/cb_and_cc/.
    Machine learning for knowledge acquisition and accelerated inverse-design for non-Hermitian systems. (arXiv:2204.13376v1 [physics.optics])
    Non-Hermitian systems offer new platforms for unusual physical properties that can be flexibly manipulated by redistribution of the real and imaginary parts of refractive indices, whose presence breaks conventional wave propagation symmetries, leading to asymmetric reflection and symmetric transmission with respect to the wave propagation direction. Here, we use supervised and unsupervised learning techniques for knowledge acquisition in non-Hermitian systems which accelerate the inverse design process. In particular, we construct a deep learning model that relates the transmission and asymmetric reflection in non-conservative settings and proposes sub-manifold learning to recognize non-Hermitian features from transmission spectra. The developed deep learning framework determines the feasibility of a desired spectral response for a given structure and uncovers the role of effective gain-loss parameters to tailor the spectral response. These findings pave the way for intelligent inverse design and shape our understanding of the physical mechanism in general non-Hermitian systems.
    MetaCVR: Conversion Rate Prediction via Meta Learning in Small-Scale Recommendation Scenarios. (arXiv:2112.13753v5 [cs.LG] UPDATED)
    Different from large-scale platforms such as Taobao and Amazon, CVR modeling in small-scale recommendation scenarios is more challenging due to the severe Data Distribution Fluctuation (DDF) issue. DDF prevents existing CVR models from being effective since 1) several months of data are needed to train CVR models sufficiently in small scenarios, leading to considerable distribution discrepancy between training and online serving; and 2) e-commerce promotions have significant impacts on small scenarios, leading to distribution uncertainty of the upcoming time period. In this work, we propose a novel CVR method named MetaCVR from a perspective of meta learning to address the DDF issue. Firstly, a base CVR model which consists of a Feature Representation Network (FRN) and output layers is designed and trained sufficiently with samples across months. Then we treat time periods with different data distributions as different occasions and obtain positive and negative prototypes for each occasion using the corresponding samples and the pre-trained FRN. Subsequently, a Distance Metric Network (DMN) is devised to calculate the distance metrics between each sample and all prototypes to facilitate mitigating the distribution uncertainty. At last, we develop an Ensemble Prediction Network (EPN) which incorporates the output of FRN and DMN to make the final CVR prediction. In this stage, we freeze the FRN and train the DMN and EPN with samples from recent time period, therefore effectively easing the distribution discrepancy. To the best of our knowledge, this is the first study of CVR prediction targeting the DDF issue in small-scale recommendation scenarios. Experimental results on real-world datasets validate the superiority of our MetaCVR and online A/B test also shows our model achieves impressive gains of 11.92% on PCVR and 8.64% on GMV.
    Music Enhancement via Image Translation and Vocoding. (arXiv:2204.13289v1 [cs.SD])
    Consumer-grade music recordings such as those captured by mobile devices typically contain distortions in the form of background noise, reverb, and microphone-induced EQ. This paper presents a deep learning approach to enhance low-quality music recordings by combining (i) an image-to-image translation model for manipulating audio in its mel-spectrogram representation and (ii) a music vocoding model for mapping synthetically generated mel-spectrograms to perceptually realistic waveforms. We find that this approach to music enhancement outperforms baselines which use classical methods for mel-spectrogram inversion and an end-to-end approach directly mapping noisy waveforms to clean waveforms. Additionally, in evaluating the proposed method with a listening test, we analyze the reliability of common audio enhancement evaluation metrics when used in the music domain.
    Control-Aware Prediction Objectives for Autonomous Driving. (arXiv:2204.13319v1 [cs.LG])
    Autonomous vehicle software is typically structured as a modular pipeline of individual components (e.g., perception, prediction, and planning) to help separate concerns into interpretable sub-tasks. Even when end-to-end training is possible, each module has its own set of objectives used for safety assurance, sample efficiency, regularization, or interpretability. However, intermediate objectives do not always align with overall system performance. For example, optimizing the likelihood of a trajectory prediction module might focus more on easy-to-predict agents than safety-critical or rare behaviors (e.g., jaywalking). In this paper, we present control-aware prediction objectives (CAPOs), to evaluate the downstream effect of predictions on control without requiring the planner be differentiable. We propose two types of importance weights that weight the predictive likelihood: one using an attention model between agents, and another based on control variation when exchanging predicted trajectories for ground truth trajectories. Experimentally, we show our objectives improve overall system performance in suburban driving scenarios using the CARLA simulator.
    Partitioned Variational Inference: A Framework for Probabilistic Federated Learning. (arXiv:2202.12275v4 [stat.ML] UPDATED)
    The proliferation of computing devices has brought about an opportunity to deploy machine learning models on new problem domains using previously inaccessible data. Traditional algorithms for training such models often require data to be stored on a single machine with compute performed by a single node, making them unsuitable for decentralised training on multiple devices. This deficiency has motivated the development of federated learning algorithms, which allow multiple data owners to train collaboratively and use a shared model whilst keeping local data private. However, many of these algorithms focus on obtaining point estimates of model parameters, rather than probabilistic estimates capable of capturing model uncertainty, which is essential in many applications. Variational inference (VI) has become the method of choice for fitting many modern probabilistic models. In this paper we introduce partitioned variational inference (PVI), a general framework for performing VI in the federated setting. We develop new supporting theory for PVI, demonstrating a number of properties that make it an attractive choice for practitioners; use PVI to unify a wealth of fragmented, yet related literature; and provide empirical results that showcase the effectiveness of PVI in a variety of federated settings.
    UNBUS: Uncertainty-aware Deep Botnet Detection System in Presence of Perturbed Samples. (arXiv:2204.09502v2 [cs.CR] UPDATED)
    A rising number of botnet families have been successfully detected using deep learning architectures. While the variety of attacks increases, these architectures should become more robust against attacks. They have been proven to be very sensitive to small but well constructed perturbations in the input. Botnet detection requires extremely low false-positive rates (FPR), which are not commonly attainable in contemporary deep learning. Attackers try to increase the FPRs by making poisoned samples. The majority of recent research has focused on the use of model loss functions to build adversarial examples and robust models. In this paper, two LSTM-based classification algorithms for botnet classification with an accuracy higher than 98% are presented. Then, the adversarial attack is proposed, which reduces the accuracy to about 30%. Then, by examining the methods for computing the uncertainty, the defense method is proposed to increase the accuracy to about 70%. By using the deep ensemble and stochastic weight averaging quantification methods it has been investigated the uncertainty of the accuracy in the proposed methods.
    Actor-Critic Scheduling for Path-Aware Air-to-Ground Multipath Multimedia Delivery. (arXiv:2204.13343v1 [cs.NI])
    Reinforcement Learning (RL) has recently found wide applications in network traffic management and control because some of its variants do not require prior knowledge of network models. In this paper, we present a novel scheduler for real-time multimedia delivery in multipath systems based on an Actor-Critic (AC) RL algorithm. We focus on a challenging scenario of real-time video streaming from an Unmanned Aerial Vehicle (UAV) using multiple wireless paths. The scheduler acting as an RL agent learns in real-time the optimal policy for path selection, path rate allocation and redundancy estimation for flow protection. The scheduler, implemented as a module of the GStreamer framework, can be used in real or simulated settings. The simulation results show that our scheduler can target a very low loss rate at the receiver by dynamically adapting in real-time the scheduling policy to the path conditions without performing training or relying on prior knowledge of network channel models.
    ELM: Embedding and Logit Margins for Long-Tail Learning. (arXiv:2204.13208v1 [cs.LG])
    Long-tail learning is the problem of learning under skewed label distributions, which pose a challenge for standard learners. Several recent approaches for the problem have proposed enforcing a suitable margin in logit space. Such techniques are intuitive analogues of the guiding principle behind SVMs, and are equally applicable to linear models and neural models. However, when applied to neural models, such techniques do not explicitly control the geometry of the learned embeddings. This can be potentially sub-optimal, since embeddings for tail classes may be diffuse, resulting in poor generalization for these classes. We present Embedding and Logit Margins (ELM), a unified approach to enforce margins in logit space, and regularize the distribution of embeddings. This connects losses for long-tail learning to proposals in the literature on metric embedding, and contrastive learning. We theoretically show that minimising the proposed ELM objective helps reduce the generalisation gap. The ELM method is shown to perform well empirically, and results in tighter tail class embeddings.
    Minimizing Client Drift in Federated Learning via Adaptive Bias Estimation. (arXiv:2204.13170v1 [cs.LG])
    In Federated Learning a number of clients collaborate to train a model without sharing their data. Client models are optimized locally and are communicated through a central hub called server. A major challenge is to deal with heterogeneity among clients' data which causes the local optimization to drift away with respect to the global objective. In order to estimate and therefore remove this drift, variance reduction techniques have been incorporated into Federated Learning optimization recently. However, the existing solutions propagate the error of their estimations, throughout the optimization trajectory which leads to inaccurate approximations of the clients' drift and ultimately failure to remove them properly. In this paper, we address this issue by introducing an adaptive algorithm that efficiently reduces clients' drift. Compared to the previous works on adapting variance reduction to Federated Learning, our approach uses less or the same level of communication bandwidth, computation or memory. Additionally, it addresses the instability problem--prevalent in prior work, caused by increasing norm of the estimates which makes our approach a much more practical solution for large scale Federated Learning settings. Our experimental results demonstrate that the proposed algorithm converges significantly faster and achieves higher accuracy compared to the baselines in an extensive set of Federated Learning benchmarks.
    Continual Learning with Bayesian Model based on a Fixed Pre-trained Feature Extractor. (arXiv:2204.13349v1 [cs.LG])
    Deep learning has shown its human-level performance in various applications. However, current deep learning models are characterised by catastrophic forgetting of old knowledge when learning new classes. This poses a challenge particularly in intelligent diagnosis systems where initially only training data of a limited number of diseases are available. In this case, updating the intelligent system with data of new diseases would inevitably downgrade its performance on previously learned diseases. Inspired by the process of learning new knowledge in human brains, we propose a Bayesian generative model for continual learning built on a fixed pre-trained feature extractor. In this model, knowledge of each old class can be compactly represented by a collection of statistical distributions, e.g. with Gaussian mixture models, and naturally kept from forgetting in continual learning over time. Unlike existing class-incremental learning methods, the proposed approach is not sensitive to the continual learning process and can be additionally well applied to the data-incremental learning scenario. Experiments on multiple medical and natural image classification tasks showed that the proposed approach outperforms state-of-the-art approaches which even keep some images of old classes during continual learning of new classes.
    AutoLossGen: Automatic Loss Function Generation for Recommender Systems. (arXiv:2204.13160v1 [cs.IR])
    In recommendation systems, the choice of loss function is critical since a good loss may significantly improve the model performance. However, manually designing a good loss is a big challenge due to the complexity of the problem. A large fraction of previous work focuses on handcrafted loss functions, which needs significant expertise and human effort. In this paper, inspired by the recent development of automated machine learning, we propose an automatic loss function generation framework, AutoLossGen, which is able to generate loss functions directly constructed from basic mathematical operators without prior knowledge on loss structure. More specifically, we develop a controller model driven by reinforcement learning to generate loss functions, and develop iterative and alternating optimization schedule to update the parameters of both the controller model and the recommender model. One challenge for automatic loss generation in recommender systems is the extreme sparsity of recommendation datasets, which leads to the sparse reward problem for loss generation and search. To solve the problem, we further develop a reward filtering mechanism for efficient and effective loss generation. Experimental results show that our framework manages to create tailored loss functions for different recommendation models and datasets, and the generated loss gives better recommendation performance than commonly used baseline losses. Besides, most of the generated losses are transferable, i.e., the loss generated based on one model and dataset also works well for another model or dataset. Source code of the work is available at https://github.com/rutgerswiselab/AutoLossGen.
    SwiftAgg+: Achieving Asymptotically Optimal Communication Loads in Secure Aggregation for Federated Learning. (arXiv:2203.13060v2 [cs.IT] UPDATED)
    We propose SwiftAgg+, a novel secure aggregation protocol for federated learning systems, where a central server aggregates local models of $N \in \mathbb{N}$ distributed users, each of size $L \in \mathbb{N}$, trained on their local data, in a privacy-preserving manner. SwiftAgg+ can significantly reduce the communication overheads without any compromise on security, and achieve optimal communication loads within diminishing gaps. Specifically, in presence of at most $D$ dropout users, SwiftAgg+ achieves a per-user communication load of $(1+\mathcal{O}(\frac{1}{N}))L$ and a server communication load of $(1+\mathcal{O}(\frac{1}{N}))L$, with a worst-case information-theoretic security guarantee, against any subset of up to $T$ semi-honest users who may also collude with the curious server. Moreover, the proposed SwiftAgg+ allows for a flexible trade-off between communication loads and the number of active communication links. In particular, for any $K\in\mathbb{N}$, SwiftAgg+ can achieve the server communication load of $(1+\frac{T}{K})L$, and per-user communication load of up to $(1+\frac{T+D}{K})L$, where the number of pair-wise active connections in the network is $\frac{N}{2}(K+T+D+1)$.
    Gaussian Processes and Statistical Decision-making in Non-Euclidean Spaces. (arXiv:2202.10613v3 [stat.ML] UPDATED)
    Bayesian learning using Gaussian processes provides a foundational framework for making decisions in a manner that balances what is known with what could be learned by gathering data. In this dissertation, we develop techniques for broadening the applicability of Gaussian processes. This is done in two ways. Firstly, we develop pathwise conditioning techniques for Gaussian processes, which allow one to express posterior random functions as prior random functions plus a dependent update term. We introduce a wide class of efficient approximations built from this viewpoint, which can be randomly sampled once in advance, and evaluated at arbitrary locations without any subsequent stochasticity. This key property improves efficiency and makes it simpler to deploy Gaussian process models in decision-making settings. Secondly, we develop a collection of Gaussian process models over non-Euclidean spaces, including Riemannian manifolds and graphs. We derive fully constructive expressions for the covariance kernels of scalar-valued Gaussian processes on Riemannian manifolds and graphs. Building on these ideas, we describe a formalism for defining vector-valued Gaussian processes on Riemannian manifolds. The introduced techniques allow all of these models to be trained using standard computational methods. In total, these contributions make Gaussian processes easier to work with and allow them to be used within a wider class of domains in an effective and principled manner. This, in turn, makes it possible to potentially apply Gaussian processes to novel decision-making settings.
    Model-Based Safe Policy Search from Signal Temporal Logic Specifications Using Recurrent Neural Networks. (arXiv:2103.15938v2 [eess.SY] UPDATED)
    We propose a policy search approach to learn controllers from specifications given as Signal Temporal Logic (STL) formulae. The system model, which is unknown but assumed to be an affine control system, is learned together with the control policy. The model is implemented as two feedforward neural networks (FNNs) - one for the drift, and one for the control directions. To capture the history dependency of STL specifications, we use a recurrent neural network (RNN) to implement the control policy. In contrast to prevalent model-free methods, the learning approach proposed here takes advantage of the learned model and is more efficient. We use control barrier functions (CBFs) with the learned model to improve the safety of the system. We validate our algorithm via simulations and experiments. The results show that our approach can satisfy the given specification within very few system runs, and can be used for on-line control.
    Compositional Federated Learning for Distributionally Robust and Meta Learning. (arXiv:2106.11264v2 [cs.LG] UPDATED)
    In the paper, we propose an effective and efficient Compositional Federated Learning (ComFedL) algorithm for solving a new compositional Federated Learning (FL) framework, which frequently appears in many data mining and machine learning problems with a hierarchical structure such as distributionally robust FL and model-agnostic meta learning (MAML). Moreover, we study the convergence analysis of our ComFedL algorithm under some mild conditions, and prove that it achieves a convergence rate of $O(\frac{1}{\sqrt{T}})$, where $T$ denotes the number of iteration. To the best of our knowledge, our new Compositional FL framework is the first work to bridge federated learning with composition stochastic optimization. In particular, we first transform the distributionally robust FL (i.e., a minimax optimization problem) into a simple composition optimization problem by using KL divergence regularization. At the same time, we also first transform the distribution-agnostic MAML problem (i.e., a minimax optimization problem) into a simple yet effective composition optimization problem. Finally, we apply two popular machine learning tasks, i.e., distributionally robust FL and MAML to demonstrate the effectiveness of our algorithm.
    ES-ENAS: Blackbox Optimization over Hybrid Spaces via Combinatorial and Continuous Evolution. (arXiv:2101.07415v5 [cs.LG] UPDATED)
    In this paper, we approach the problem of optimizing blackbox functions over large hybrid search spaces consisting of both combinatorial and continuous parameters. We demonstrate that previous evolutionary algorithms which rely on mutation-based approaches, while flexible over combinatorial spaces, suffer from a curse of dimensionality in high dimensional continuous spaces both theoretically and empirically, which thus limits their scope over hybrid search spaces as well. In order to combat this curse, we propose ES-ENAS, a simple and modular joint optimization procedure combining the class of sample-efficient smoothed gradient gradient techniques, commonly known as Evolutionary Strategies (ES), with combinatorial optimizers in a highly scalable and intuitive way, inspired by the one-shot or supernet paradigm introduced in Efficient Neural Architecture Search (ENAS). By doing so, we achieve significantly more sample efficiency, which we empirically demonstrate over synthetic benchmarks, and are further able to apply ES-ENAS for architecture search over popular RL benchmarks.
    Classifier Calibration: with application to threat scores in cybersecurity. (arXiv:2102.05143v3 [cs.LG] UPDATED)
    This paper explores the calibration of a classifier output score in binary classification problems. A calibrator is a function that maps the arbitrary classifier score, of a testing observation, onto $[0,1]$ to provide an estimate for the posterior probability of belonging to one of the two classes. Calibration is important for two reasons; first, it provides a meaningful score, that is the posterior probability; second, it puts the scores of different classifiers on the same scale for comparable interpretation. The paper presents three main contributions: (1) Introducing multi-score calibration, when more than one classifier provides a score for a single observation. (2) Introducing the idea that the classifier scores to a calibration process are nothing but features to a classifier, hence proposing expanding the classifier scores to higher dimensions to boost the calibrator's performance. (3) Conducting a massive simulation study, in the order of 24,000 experiments, that incorporates different configurations, in addition to experimenting on two real datasets from the cybersecurity domain. The results show that there is no overall winner among the different calibrators and different configurations. However, general advices for practitioners include the following: the Platt's calibrator~\citep{Platt1999ProbabilisticOutputsForSupport}, a version of the logistic regression that decreases bias for a small sample size, has a very stable and acceptable performance among all experiments; our suggested multi-score calibration provides better performance than single score calibration in the majority of experiments, including the two real datasets. In addition, expanding the scores can help in some experiments.
    Consistent Relative Confidence and Label-Free Model Selection for Convolutional Neural Networks. (arXiv:2108.11845v7 [cs.CV] UPDATED)
    This letter is concerned with image classification with deep convolutional neural networks (CNNs). The focus is on the following question: given a set of candidate CNN models, how to select the right one with the best generalization property for the current task? Present model selection methods require access to a batch of labeled data for computing a pre-specified performance metric, such as the cross-entropy loss, the classification error rate, the negative log-likelihood. In many practical cases, labels are not available in time as labeling itself is a time-consuming and expensive task. To this end, this letter presents an approach to CNN model selection using only unlabeled data. This method is developed based on a principle termed consistent relative confidence. The effectiveness and efficiency of the proposed method are demonstrated by experiments using benchmark datasets.
    Self-organizing Democratized Learning: Towards Large-scale Distributed Learning Systems. (arXiv:2007.03278v3 [cs.LG] UPDATED)
    Emerging cross-device artificial intelligence (AI) applications require a transition from conventional centralized learning systems towards large-scale distributed AI systems that can collaboratively perform complex learning tasks. In this regard, democratized learning (Dem-AI) lays out a holistic philosophy with underlying principles for building large-scale distributed and democratized machine learning systems. The outlined principles are meant to study a generalization in distributed learning systems that goes beyond existing mechanisms such as federated learning. Moreover, such learning systems rely on hierarchical self-organization of well-connected distributed learning agents who have limited and highly personalized data and can evolve and regulate themselves based on the underlying duality of specialized and generalized processes. Inspired by Dem-AI philosophy, a novel distributed learning approach is proposed in this paper. The approach consists of a self-organizing hierarchical structuring mechanism based on agglomerative clustering, hierarchical generalization, and corresponding learning mechanism. Subsequently, hierarchical generalized learning problems in recursive forms are formulated and shown to be approximately solved using the solutions of distributed personalized learning problems and hierarchical update mechanisms. To that end, a distributed learning algorithm, namely DemLearn is proposed. Extensive experiments on benchmark MNIST, Fashion-MNIST, FE-MNIST, and CIFAR-10 datasets show that the proposed algorithms demonstrate better results in the generalization performance of learning models in agents compared to the conventional FL algorithms. The detailed analysis provides useful observations to further handle both the generalization and specialization performance of the learning models in Dem-AI systems.
    Quality Inference in Federated Learning with Secure Aggregation. (arXiv:2007.06236v3 [cs.LG] UPDATED)
    Federated learning algorithms are developed both for efficiency reasons and to ensure the privacy and confidentiality of personal and business data, respectively. Despite no data being shared explicitly, recent studies showed that the mechanism could still leak sensitive information. Hence, secure aggregation is utilized in many real-world scenarios to prevent attribution to specific participants. In this paper, we focus on the quality of individual training datasets and show that such quality information could be inferred and attributed to specific participants even when secure aggregation is applied. Specifically, through a series of image recognition experiments, we infer the relative quality ordering of participants. Moreover, we apply the inferred quality information to detect misbehaviours, to stabilize training performance, and to measure the individual contributions of participants.
    Computer Vision for Road Imaging and Pothole Detection: A State-of-the-Art Review of Systems and Algorithms. (arXiv:2204.13590v1 [cs.CV])
    Computer vision algorithms have been prevalently utilized for 3-D road imaging and pothole detection for over two decades. Nonetheless, there is a lack of systematic survey articles on state-of-the-art (SoTA) computer vision techniques, especially deep learning models, developed to tackle these problems. This article first introduces the sensing systems employed for 2-D and 3-D road data acquisition, including camera(s), laser scanners, and Microsoft Kinect. Afterward, it thoroughly and comprehensively reviews the SoTA computer vision algorithms, including (1) classical 2-D image processing, (2) 3-D point cloud modeling and segmentation, and (3) machine/deep learning, developed for road pothole detection. This article also discusses the existing challenges and future development trends of computer vision-based road pothole detection approaches: classical 2-D image processing-based and 3-D point cloud modeling and segmentation-based approaches have already become history; and Convolutional neural networks (CNNs) have demonstrated compelling road pothole detection results and are promising to break the bottleneck with the future advances in self/un-supervised learning for multi-modal semantic segmentation. We believe that this survey can serve as practical guidance for developing the next-generation road condition assessment systems.
    Russian Texts Detoxification with Levenshtein Editing. (arXiv:2204.13638v1 [cs.CL])
    Text detoxification is a style transfer task of creating neutral versions of toxic texts. In this paper, we use the concept of text editing to build a two-step tagging-based detoxification model using a parallel corpus of Russian texts. With this model, we achieved the best style transfer accuracy among all models in the RUSSE Detox shared task, surpassing larger sequence-to-sequence models.
    Nonbacktracking spectral clustering of nonuniform hypergraphs. (arXiv:2204.13586v1 [cs.SI])
    Spectral methods offer a tractable, global framework for clustering in graphs via eigenvector computations on graph matrices. Hypergraph data, in which entities interact on edges of arbitrary size, poses challenges for matrix representations and therefore for spectral clustering. We study spectral clustering for nonuniform hypergraphs based on the hypergraph nonbacktracking operator. After reviewing the definition of this operator and its basic properties, we prove a theorem of Ihara-Bass type to enable faster computation of eigenpairs. We then propose an alternating algorithm for inference in a hypergraph stochastic blockmodel via linearized belief-propagation, offering proofs that both formalize and extend several previous results. We perform experiments in real and synthetic data that underscore the benefits of hypergraph methods over graph-based ones when interactions of different sizes carry different information about cluster structure. Through an analysis of our algorithm, we pose several conjectures about the limits of spectral methods and detectability in hypergraph stochastic blockmodels writ large.
    PhysioGAN: Training High Fidelity Generative Model for Physiological Sensor Readings. (arXiv:2204.13597v1 [eess.SP])
    Generative models such as the variational autoencoder (VAE) and the generative adversarial networks (GAN) have proven to be incredibly powerful for the generation of synthetic data that preserves statistical properties and utility of real-world datasets, especially in the context of image and natural language text. Nevertheless, until now, there has no successful demonstration of how to apply either method for generating useful physiological sensory data. The state-of-the-art techniques in this context have achieved only limited success. We present PHYSIOGAN, a generative model to produce high fidelity synthetic physiological sensor data readings. PHYSIOGAN consists of an encoder, decoder, and a discriminator. We evaluate PHYSIOGAN against the state-of-the-art techniques using two different real-world datasets: ECG classification and activity recognition from motion sensors datasets. We compare PHYSIOGAN to the baseline models not only the accuracy of class conditional generation but also the sample diversity and sample novelty of the synthetic datasets. We prove that PHYSIOGAN generates samples with higher utility than other generative models by showing that classification models trained on only synthetic data generated by PHYSIOGAN have only 10% and 20% decrease in their classification accuracy relative to classification models trained on the real data. Furthermore, we demonstrate the use of PHYSIOGAN for sensor data imputation in creating plausible results.
    Personalized Federated Learning with Multiple Known Clusters. (arXiv:2204.13619v1 [cs.LG])
    We consider the problem of personalized federated learning when there are known cluster structures within users. An intuitive approach would be to regularize the parameters so that users in the same cluster share similar model weights. The distances between the clusters can then be regularized to reflect the similarity between different clusters of users. We develop an algorithm that allows each cluster to communicate independently and derive the convergence results. We study a hierarchical linear model to theoretically demonstrate that our approach outperforms agents learning independently and agents learning a single shared weight. Finally, we demonstrate the advantages of our approach using both simulated and real-world data.
    Zero-Shot Logit Adjustment. (arXiv:2204.11822v2 [cs.CV] UPDATED)
    Semantic-descriptor-based Generalized Zero-Shot Learning (GZSL) poses challenges in recognizing novel classes in the test phase. The development of generative models enables current GZSL techniques to probe further into the semantic-visual link, culminating in a two-stage form that includes a generator and a classifier. However, existing generation-based methods focus on enhancing the generator's effect while neglecting the improvement of the classifier. In this paper, we first analyze of two properties of the generated pseudo unseen samples: bias and homogeneity. Then, we perform variational Bayesian inference to back-derive the evaluation metrics, which reflects the balance of the seen and unseen classes. As a consequence of our derivation, the aforementioned two properties are incorporated into the classifier training as seen-unseen priors via logit adjustment. The Zero-Shot Logit Adjustment further puts semantic-based classifiers into effect in generation-based GZSL. Our experiments demonstrate that the proposed technique achieves state-of-the-art when combined with the basic generator, and it can improve various generative zero-shot learning frameworks. Our codes are available on https://github.com/cdb342/IJCAI-2022-ZLA.
    Continual Learning for Peer-to-Peer Federated Learning: A Study on Automated Brain Metastasis Identification. (arXiv:2204.13591v1 [cs.LG])
    Due to data privacy constraints, data sharing among multiple centers is restricted. Continual learning, as one approach to peer-to-peer federated learning, can promote multicenter collaboration on deep learning algorithm development by sharing intermediate models instead of training data. This work aims to investigate the feasibility of continual learning for multicenter collaboration on an exemplary application of brain metastasis identification using DeepMedic. 920 T1 MRI contrast enhanced volumes are split to simulate multicenter collaboration scenarios. A continual learning algorithm, synaptic intelligence (SI), is applied to preserve important model weights for training one center after another. In a bilateral collaboration scenario, continual learning with SI achieves a sensitivity of 0.917, and naive continual learning without SI achieves a sensitivity of 0.906, while two models trained on internal data solely without continual learning achieve sensitivity of 0.853 and 0.831 only. In a seven-center multilateral collaboration scenario, the models trained on internal datasets (100 volumes each center) without continual learning obtain a mean sensitivity value of 0.725. With single-visit continual learning (i.e., the shared model visits each center only once during training), the sensitivity is improved to 0.788 and 0.849 without SI and with SI, respectively. With iterative continual learning (i.e., the shared model revisits each center multiple times during training), the sensitivity is further improved to 0.914, which is identical to the sensitivity using mixed data for training. Our experiments demonstrate that continual learning can improve brain metastasis identification performance for centers with limited data. This study demonstrates the feasibility of applying continual learning for peer-to-peer federated learning in multicenter collaboration.
    On tuning a mean-field model for semi-supervised classification. (arXiv:2204.13519v1 [cs.LG])
    Semi-supervised learning (SSL) has become an interesting research area due to its capacity for learning in scenarios where both labeled and unlabeled data are available. In this work, we focus on the task of transduction - when the objective is to label all data presented to the learner - with a mean-field approximation to the Potts model. Aiming at this particular task we study how classification results depend on $\beta$ and find that the optimal phase depends highly on the amount of labeled data available. In the same study, we also observe that more stable classifications regarding small fluctuations in $\beta$ are related to configurations of high probability and propose a tuning approach based on such observation. This method relies on a novel parameter $\gamma$ and we then evaluate two different values of the said quantity in comparison with classical methods in the field. This evaluation is conducted by changing the amount of labeled data available and the number of nearest neighbors in the similarity graph. Empirical results show that the tuning method is effective and allows NMF to outperform other approaches in datasets with fewer classes. In addition, one of the chosen values for $\gamma$ also leads to results that are more resilient to changes in the number of neighbors, which might be of interest to practitioners in the field of SSL.
    EVI: Multilingual Spoken Dialogue Tasks and Dataset for Knowledge-Based Enrolment, Verification, and Identification. (arXiv:2204.13496v1 [cs.CL])
    Knowledge-based authentication is crucial for task-oriented spoken dialogue systems that offer personalised and privacy-focused services. Such systems should be able to enrol (E), verify (V), and identify (I) new and recurring users based on their personal information, e.g. postcode, name, and date of birth. In this work, we formalise the three authentication tasks and their evaluation protocols, and we present EVI, a challenging spoken multilingual dataset with 5,506 dialogues in English, Polish, and French. Our proposed models set the first competitive benchmarks, explore the challenges of multilingual natural language processing of spoken dialogue, and set directions for future research.
    Learning General Inventory Management Policy for Large Supply Chain Network. (arXiv:2204.13378v1 [cs.AI])
    Inventory management in warehouses directly affects profits made by manufacturers. Particularly, large manufacturers produce a very large variety of products that are handled by a significantly large number of retailers. In such a case, the computational complexity of classical inventory management algorithms is inordinately large. In recent years, learning-based approaches have become popular for addressing such problems. However, previous studies have not been managed systems where both the number of products and retailers are large. This study proposes a reinforcement learning-based warehouse inventory management algorithm that can be used for supply chain systems where both the number of products and retailers are large. To solve the computational problem of handling large systems, we provide a means of approximate simulation of the system in the training phase. Our experiments on both real and artificial data demonstrate that our algorithm with approximated simulation can successfully handle large supply chain networks.
    Mixup-based Deep Metric Learning Approaches for Incomplete Supervision. (arXiv:2204.13572v1 [cs.LG])
    Deep learning architectures have achieved promising results in different areas (e.g., medicine, agriculture, and security). However, using those powerful techniques in many real applications becomes challenging due to the large labeled collections required during training. Several works have pursued solutions to overcome it by proposing strategies that can learn more for less, e.g., weakly and semi-supervised learning approaches. As these approaches do not usually address memorization and sensitivity to adversarial examples, this paper presents three deep metric learning approaches combined with Mixup for incomplete-supervision scenarios. We show that some state-of-the-art approaches in metric learning might not work well in such scenarios. Moreover, the proposed approaches outperform most of them in different datasets.
    Adversarial Fine-tune with Dynamically Regulated Adversary. (arXiv:2204.13232v1 [cs.LG])
    Adversarial training is an effective method to boost model robustness to malicious, adversarial attacks. However, such improvement in model robustness often leads to a significant sacrifice of standard performance on clean images. In many real-world applications such as health diagnosis and autonomous surgical robotics, the standard performance is more valued over model robustness against such extremely malicious attacks. This leads to the question: To what extent we can boost model robustness without sacrificing standard performance? This work tackles this problem and proposes a simple yet effective transfer learning-based adversarial training strategy that disentangles the negative effects of adversarial samples on model's standard performance. In addition, we introduce a training-friendly adversarial attack algorithm, which facilitates the boost of adversarial robustness without introducing significant training complexity. Extensive experimentation indicates that the proposed method outperforms previous adversarial training algorithms towards the target: to improve model robustness while preserving model's standard performance on clean data.
    Semantic Communication: An Information Bottleneck View. (arXiv:2204.13366v1 [cs.IT])
    Motivated by recent success of machine learning tools at the PHY layer and driven by high bandwidth demands of the next wireless communication standard 6G, the old idea of semantic communication by Weaver from 1949 has received considerable attention. It breaks with the classic design paradigm according to Shannon by aiming to transmit the meaning of a message rather than its exact copy and thus potentially allows for savings in bandwidth. In this work, inspired by Weaver, we propose an information-theoretic framework where the semantic context is explicitly introduced into probabilistic models. In particular, for bandwidth efficient transmission, we define semantic communication system design as an Information Bottleneck optimization problem and consider important implementation aspects. Further, we uncover the restrictions of the classic 5G communication system design w.r.t. semantic context. Notably, based on the example of distributed image classification, we reveal the huge potential of a semantic communication system design. Numerical results show a tremendous saving in bandwidth of 20 dB with our proposed approach ISCNet compared to a classic PHY layer design.
    WeaNF: Weak Supervision with Normalizing Flows. (arXiv:2204.13409v1 [cs.CL])
    A popular approach to decrease the need for costly manual annotation of large data sets is weak supervision, which introduces problems of noisy labels, coverage and bias. Methods for overcoming these problems have either relied on discriminative models, trained with cost functions specific to weak supervision, and more recently, generative models, trying to model the output of the automatic annotation process. In this work, we explore a novel direction of generative modeling for weak supervision: Instead of modeling the output of the annotation process (the labeling function matches), we generatively model the input-side data distributions (the feature space) covered by labeling functions. Specifically, we estimate a density for each weak labeling source, or labeling function, by using normalizing flows. An integral part of our method is the flow-based modeling of multiple simultaneously matching labeling functions, and therefore phenomena such as labeling function overlap and correlations are captured. We analyze the effectiveness and modeling capabilities on various commonly used weak supervision data sets, and show that weakly supervised normalizing flows compare favorably to standard weak supervision baselines.
    Continual Backprop: Stochastic Gradient Descent with Persistent Randomness. (arXiv:2108.06325v2 [cs.LG] UPDATED)
    The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former. We show that in continual learning setups, Backprop performs well initially, but over time its performance degrades. Stochastic gradient descent alone is insufficient to learn continually; the initial randomness enables only initial learning but not continual learning. To the best of our knowledge, ours is the first result showing this degradation in Backprop's ability to learn. To address this degradation in Backprop's plasticity, we propose an algorithm that continually injects random features alongside gradient descent using a new generate-and-test process. We call this the \textit{Continual Backprop} algorithm. We show that, unlike Backprop, Continual Backprop is able to continually adapt in both supervised and reinforcement learning (RL) problems. Continual Backprop has the same computational complexity as Backprop and can be seen as a natural extension of Backprop for continual learning.
    Schr\"odinger's FP: Dynamic Adaptation of Floating-Point Containers for Deep Learning Training. (arXiv:2204.13666v1 [cs.LG])
    We introduce a software-hardware co-design approach to reduce memory traffic and footprint during training with BFloat16 or FP32 boosting energy efficiency and execution time performance. We introduce methods to dynamically adjust the size and format of the floating-point containers used to store activations and weights during training. The different value distributions lead us to different approaches for exponents and mantissas. Gecko exploits the favourable exponent distribution with a loss-less delta encoding approach to reduce the total exponent footprint by up to $58\%$ in comparison to a 32 bit floating point baseline. To content with the noisy mantissa distributions, we present two lossy methods to eliminate as many as possible least significant bits while not affecting accuracy. Quantum Mantissa, is a machine learning-first mantissa compression method that taps on training's gradient descent algorithm to also learn minimal mantissa bitlengths on a per-layer granularity, and obtain up to $92\%$ reduction in total mantissa footprint. Alternatively, BitChop observes changes in the loss function during training to adjust mantissa bit-length network-wide yielding a reduction of $81\%$ in footprint. Schr\"{o}dinger's FP implements hardware encoders/decoders that guided by Gecko/Quantum Mantissa or Gecko/BitChop transparently encode/decode values when transferring to/from off-chip memory boosting energy efficiency and reducing execution time.
    Interpretable Graph Convolutional Network of Multi-Modality Brain Imaging for Alzheimer's Disease Diagnosis. (arXiv:2204.13188v1 [cs.LG])
    Identification of brain regions related to the specific neurological disorders are of great importance for biomarker and diagnostic studies. In this paper, we propose an interpretable Graph Convolutional Network (GCN) framework for the identification and classification of Alzheimer's disease (AD) using multi-modality brain imaging data. Specifically, we extended the Gradient Class Activation Mapping (Grad-CAM) technique to quantify the most discriminative features identified by GCN from brain connectivity patterns. We then utilized them to find signature regions of interest (ROIs) by detecting the difference of features between regions in healthy control (HC), mild cognitive impairment (MCI), and AD groups. We conducted the experiments on the ADNI database with imaging data from three modalities, including VBM-MRI, FDG-PET, and AV45-PET, and showed that the ROI features learned by our method were effective for enhancing the performances of both clinical score prediction and disease status identification. It also successfully identified biomarkers associated with AD and MCI.  ( 2 min )
    BAGNet: Bidirectional Aware Guidance Network for Malignant Breast lesions Segmentation. (arXiv:2204.13342v1 [eess.IV])
    Breast lesions segmentation is an important step of computer-aided diagnosis system, and it has attracted much attention. However, accurate segmentation of malignant breast lesions is a challenging task due to the effects of heterogeneous structure and similar intensity distributions. In this paper, a novel bidirectional aware guidance network (BAGNet) is proposed to segment the malignant lesion from breast ultrasound images. Specifically, the bidirectional aware guidance network is used to capture the context between global (low-level) and local (high-level) features from the input coarse saliency map. The introduction of the global feature map can reduce the interference of surrounding tissue (background) on the lesion regions. To evaluate the segmentation performance of the network, we compared with several state-of-the-art medical image segmentation methods on the public breast ultrasound dataset using six commonly used evaluation metrics. Extensive experimental results indicate that our method achieves the most competitive segmentation results on malignant breast ultrasound images.  ( 2 min )
    Towards Flexible Inference in Sequential Decision Problems via Bidirectional Transformers. (arXiv:2204.13326v1 [cs.LG])
    Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks. In this work, we observe that the same idea also applies naturally to sequential decision making, where many well-studied tasks like behavior cloning, offline RL, inverse dynamics, and waypoint conditioning correspond to different sequence maskings over a sequence of states, actions, and returns. We introduce the FlexiBiT framework, which provides a unified way to specify models which can be trained on many different sequential decision making tasks. We show that a single FlexiBiT model is simultaneously capable of carrying out many tasks with performance similar to or better than specialized models. Additionally, we show that performance can be further improved by fine-tuning our general model on specific tasks of interest.  ( 2 min )
    Offline Visual Representation Learning for Embodied Navigation. (arXiv:2204.13226v1 [cs.CV])
    How should we learn visual representations for embodied agents that must see and move? The status quo is tabula rasa in vivo, i.e. learning visual representations from scratch while also learning to move, potentially augmented with auxiliary tasks (e.g. predicting the action taken between two successive observations). In this paper, we show that an alternative 2-stage strategy is far more effective: (1) offline pretraining of visual representations with self-supervised learning (SSL) using large-scale pre-rendered images of indoor environments (Omnidata), and (2) online finetuning of visuomotor representations on specific tasks with image augmentations under long learning schedules. We call this method Offline Visual Representation Learning (OVRL). We conduct large-scale experiments - on 3 different 3D datasets (Gibson, HM3D, MP3D), 2 tasks (ImageNav, ObjectNav), and 2 policy learning algorithms (RL, IL) - and find that the OVRL representations lead to significant across-the-board improvements in state of art, on ImageNav from 29.2% to 54.2% (+25% absolute, 86% relative) and on ObjectNav from 18.1% to 23.2% (+5.1% absolute, 28% relative). Importantly, both results were achieved by the same visual encoder generalizing to datasets that were not seen during pretraining. While the benefits of pretraining sometimes diminish (or entirely disappear) with long finetuning schedules, we find that OVRL's performance gains continue to increase (not decrease) as the agent is trained for 2 billion frames of experience.  ( 2 min )
    Anomaly Detection by Leveraging Incomplete Anomalous Knowledge with Anomaly-Aware Bidirectional GANs. (arXiv:2204.13335v1 [cs.LG])
    The goal of anomaly detection is to identify anomalous samples from normal ones. In this paper, a small number of anomalies are assumed to be available at the training stage, but they are assumed to be collected only from several anomaly types, leaving the majority of anomaly types not represented in the collected anomaly dataset at all. To effectively leverage this kind of incomplete anomalous knowledge represented by the collected anomalies, we propose to learn a probability distribution that can not only model the normal samples, but also guarantee to assign low density values for the collected anomalies. To this end, an anomaly-aware generative adversarial network (GAN) is developed, which, in addition to modeling the normal samples as most GANs do, is able to explicitly avoid assigning probabilities for collected anomalous samples. Moreover, to facilitate the computation of anomaly detection criteria like reconstruction error, the proposed anomaly-aware GAN is designed to be bidirectional, attaching an encoder for the generator. Extensive experimental results demonstrate that our proposed method is able to effectively make use of the incomplete anomalous information, leading to significant performance gains compared to existing methods.  ( 2 min )
    TransHER: Translating Knowledge Graph Embedding with Hyper-Ellipsoidal Restriction. (arXiv:2204.13221v1 [cs.AI])
    Knowledge graph embedding methods are important for knowledge graph completion (link prediction) due to their robust performance and efficiency on large-magnitude datasets. One state-of-the-art method, PairRE, leverages two separate vectors for relations to model complex relations (i.e., 1-to-N, N-to-1, and N-to-N) in knowledge graphs. However, such a method strictly restricts entities on the hyper-ellipsoid surface and thus limits the optimization of entity distribution, which largely hinders the performance of knowledge graph completion. To address this problem, we propose a novel score function TransHER, which leverages relation-specific translations between head and tail entities restricted on separate hyper-ellipsoids. Specifically, given a triplet, our model first maps entities onto two separate hyper-ellipsoids and then conducts a relation-specific translation on one of them. The relation-specific translation provides TransHER with more direct guidance in optimization and the ability to learn semantic characteristics of entities with complex relations. Experimental results show that TransHER can achieve state-of-the-art performance and generalize to datasets in different domains and scales. All our code will be publicly available.  ( 2 min )
    Open challenges for Machine Learning based Early Decision-Making research. (arXiv:2204.13111v1 [cs.LG])
    More and more applications require early decisions, i.e. taken as soon as possible from partially observed data. However, the later a decision is made, the more its accuracy tends to improve, since the description of the problem to hand is enriched over time. Such a compromise between the earliness and the accuracy of decisions has been particularly studied in the field of Early Time Series Classification. This paper introduces a more general problem, called Machine Learning based Early Decision Making (ML-EDM), which consists in optimizing the decision times of models in a wide range of settings where data is collected over time. After defining the ML-EDM problem, ten challenges are identified and proposed to the scientific community to further research in this area. These challenges open important application perspectives, discussed in this paper.  ( 2 min )
    Covariance-aware Feature Alignment with Pre-computed Source Statistics for Test-time Adaptation. (arXiv:2204.13263v1 [cs.LG])
    The accuracy of deep neural networks is degraded when the distribution of features in the test environment (target domain) differs from that of the training (source) environment. To mitigate the degradation, test-time adaptation (TTA), where a model adapts to the target domain without access to the source dataset, can be used in the test environment. However, the existing TTA methods lack feature distribution alignment between the source and target domains, which unsupervised domain adaptation mainly addresses, because accessing the source dataset is prohibited in the TTA setting. In this paper, we propose a novel TTA method, named Covariance-Aware Feature alignment (CAFe), which explicitly aligns the source and target feature distributions at test time. To perform alignment without accessing the source data, CAFe uses auxiliary feature statistics (mean and covariance) pre-computed on the source domain, which are lightweight and easily prepared. Further, to improve efficiency and stability, we propose feature grouping, which splits the feature dimensions into groups according to their correlations by using spectral clustering to avoid degeneration of the covariance matrix. We empirically show that CAFe outperforms prior TTA methods on a variety of distribution shifts.  ( 2 min )
    Counterfactual Explanations for Natural Language Interfaces. (arXiv:2204.13192v1 [cs.CL])
    A key challenge facing natural language interfaces is enabling users to understand the capabilities of the underlying system. We propose a novel approach for generating explanations of a natural language interface based on semantic parsing. We focus on counterfactual explanations, which are post-hoc explanations that describe to the user how they could have minimally modified their utterance to achieve their desired goal. In particular, the user provides an utterance along with a demonstration of their desired goal; then, our algorithm synthesizes a paraphrase of their utterance that is guaranteed to achieve their goal. In two user studies, we demonstrate that our approach substantially improves user performance, and that it generates explanations that more closely match the user's intent compared to two ablations.  ( 2 min )
    On the Convergence of Momentum-Based Algorithms for Federated Stochastic Bilevel Optimization Problems. (arXiv:2204.13299v1 [cs.LG])
    In this paper, we studied the federated stochastic bilevel optimization problem. In particular, we developed two momentum-based algorithms for optimizing this kind of problem. In addition, we established the convergence rate of these two algorithms, providing their sample and communication complexities. To the best of our knowledge, this is the first work achieving such favorable theoretical results.  ( 2 min )
    Watts: Infrastructure for Open-Ended Learning. (arXiv:2204.13250v1 [cs.AI])
    This paper proposes a framework called Watts for implementing, comparing, and recombining open-ended learning (OEL) algorithms. Motivated by modularity and algorithmic flexibility, Watts atomizes the components of OEL systems to promote the study of and direct comparisons between approaches. Examining implementations of three OEL algorithms, the paper introduces the modules of the framework. The hope is for Watts to enable benchmarking and to explore new types of OEL algorithms. The repo is available at \url{https://github.com/aadharna/watts}  ( 2 min )
    R-MBO: A Multi-surrogate Approach for Preference Incorporation in Multi-objective Bayesian Optimisation. (arXiv:2204.13166v1 [stat.ML])
    Many real-world multi-objective optimisation problems rely on computationally expensive function evaluations. Multi-objective Bayesian optimisation (BO) can be used to alleviate the computation time to find an approximated set of Pareto optimal solutions. In many real-world problems, a decision-maker has some preferences on the objective functions. One approach to incorporate the preferences in multi-objective BO is to use a scalarising function and build a single surrogate model (mono-surrogate approach) on it. This approach has two major limitations. Firstly, the fitness landscape of the scalarising function and the objective functions may not be similar. Secondly, the approach assumes that the scalarising function distribution is Gaussian, and thus a closed-form expression of an acquisition function e.g., expected improvement can be used. We overcome these limitations by building independent surrogate models (multi-surrogate approach) on each objective function and show that the distribution of the scalarising function is not Gaussian. We approximate the distribution using Generalised value distribution. We present an a-priori multi-surrogate approach to incorporate the desirable objective function values (or reference point) as the preferences of a decision-maker in multi-objective BO. The results and comparison with the existing mono-surrogate approach on benchmark and real-world optimisation problems show the potential of the proposed approach.  ( 2 min )
  • Open

    Exchangeability-Aware Sum-Product Networks. (arXiv:2110.05165v2 [cs.LG] UPDATED)
    Sum-Product Networks (SPNs) are expressive probabilistic models that provide exact, tractable inference. They achieve this efficiency by making use of local independence. On the other hand, mixtures of exchangeable variable models (MEVMs) are a class of tractable probabilistic models that make use of exchangeability of discrete random variables to render inference tractable. Exchangeability, which arises naturally in relational domains, has not been considered for efficient representation and inference in SPNs yet. The contribution of this paper is a novel probabilistic model which we call Exchangeability-Aware Sum-Product Networks (XSPNs). It contains both SPNs and MEVMs as special cases, and combines the ability of SPNs to efficiently learn deep probabilistic models with the ability of MEVMs to efficiently handle exchangeable random variables. We introduce a structure learning algorithm for XSPNs and empirically show that they can be more accurate than conventional SPNs when the data contains repeated, interchangeable parts.  ( 2 min )
    Performance analysis of greedy algorithms for minimising a Maximum Mean Discrepancy. (arXiv:2101.07564v2 [stat.ML] UPDATED)
    We analyse the performance of several iterative algorithms for the quantisation of a probability measure $\mu$, based on the minimisation of a Maximum Mean Discrepancy (MMD). Our analysis includes kernel herding, greedy MMD minimisation and Sequential Bayesian Quadrature (SBQ). We show that the finite-sample-size approximation error, measured by the MMD, decreases as $1/n$ for SBQ and also for kernel herding and greedy MMD minimisation when using a suitable step-size sequence. The upper bound on the approximation error is slightly better for SBQ, but the other methods are significantly faster, with a computational cost that increases only linearly with the number of points selected. This is illustrated by two numerical examples, with the target measure $\mu$ being uniform (a space-filling design application) and with $\mu$ a Gaussian mixture. They suggest that the bounds derived in the paper are overly pessimistic, in particular for SBQ. The sources of this pessimism are identified but seem difficult to counter.  ( 2 min )
    Tracking Most Significant Arm Switches in Bandits. (arXiv:2112.13838v5 [cs.LG] UPDATED)
    In bandit with distribution shifts, one aims to automatically adapt to unknown changes in reward distribution, and restart exploration when necessary. While this problem has been studied for many years, a recent breakthrough of Auer et al. (2018, 2019) provides the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, and an unknown number $L$ of changes. However, while this rate is tight in the worst case, it remained open whether faster rates are possible, without prior knowledge, if few changes in distribution are actually severe. To resolve this question, we propose a new notion of significant shift, which only counts very severe changes that clearly necessitate a restart: roughly, these are changes involving not only best arm switches, but also involving large aggregate differences in reward overtime. Thus, our resulting procedure adaptively achieves rates always faster (sometimes significantly) than $O(\sqrt{ST})$, where $S\ll L$ only counts best arm switches, while at the same time, always faster than the optimal $O(V^{\frac{1}{3}}T^{\frac{2}{3}})$ when expressed in terms of total variation $V$ (which aggregates differences overtime). Our results are expressed in enough generality to also capture non-stochastic adversarial settings.  ( 2 min )
    Variational Inference with NoFAS: Normalizing Flow with Adaptive Surrogate for Computationally Expensive Models. (arXiv:2108.12657v2 [cs.LG] UPDATED)
    Fast inference of numerical model parameters from data is an important prerequisite to generate predictive models for a wide range of applications. Use of sampling-based approaches such as Markov chain Monte Carlo may become intractable when each likelihood evaluation is computationally expensive. New approaches combining variational inference with normalizing flow are characterized by a computational cost that grows only linearly with the dimensionality of the latent variable space, and rely on gradient-based optimization instead of sampling, providing a more efficient approach for Bayesian inference about the model parameters. Moreover, the cost of frequently evaluating an expensive likelihood can be mitigated by replacing the true model with an offline trained surrogate model, such as neural networks. However, this approach might generate significant bias when the surrogate is insufficiently accurate around the posterior modes. To reduce the computational cost without sacrificing inferential accuracy, we propose Normalizing Flow with Adaptive Surrogate (NoFAS), an optimization strategy that alternatively updates the normalizing flow parameters and surrogate model parameters. We also propose an efficient sample weighting scheme for surrogate model training that preserves global accuracy while effectively capturing high posterior density regions. We demonstrate the inferential and computational superiority of NoFAS against various benchmarks, including cases where the underlying model lacks identifiability. The source code and numerical experiments used for this study are available at https://github.com/cedricwangyu/NoFAS.  ( 2 min )
    Gaussian Processes and Statistical Decision-making in Non-Euclidean Spaces. (arXiv:2202.10613v3 [stat.ML] UPDATED)
    Bayesian learning using Gaussian processes provides a foundational framework for making decisions in a manner that balances what is known with what could be learned by gathering data. In this dissertation, we develop techniques for broadening the applicability of Gaussian processes. This is done in two ways. Firstly, we develop pathwise conditioning techniques for Gaussian processes, which allow one to express posterior random functions as prior random functions plus a dependent update term. We introduce a wide class of efficient approximations built from this viewpoint, which can be randomly sampled once in advance, and evaluated at arbitrary locations without any subsequent stochasticity. This key property improves efficiency and makes it simpler to deploy Gaussian process models in decision-making settings. Secondly, we develop a collection of Gaussian process models over non-Euclidean spaces, including Riemannian manifolds and graphs. We derive fully constructive expressions for the covariance kernels of scalar-valued Gaussian processes on Riemannian manifolds and graphs. Building on these ideas, we describe a formalism for defining vector-valued Gaussian processes on Riemannian manifolds. The introduced techniques allow all of these models to be trained using standard computational methods. In total, these contributions make Gaussian processes easier to work with and allow them to be used within a wider class of domains in an effective and principled manner. This, in turn, makes it possible to potentially apply Gaussian processes to novel decision-making settings.  ( 2 min )
    Partitioned Variational Inference: A Framework for Probabilistic Federated Learning. (arXiv:2202.12275v4 [stat.ML] UPDATED)
    The proliferation of computing devices has brought about an opportunity to deploy machine learning models on new problem domains using previously inaccessible data. Traditional algorithms for training such models often require data to be stored on a single machine with compute performed by a single node, making them unsuitable for decentralised training on multiple devices. This deficiency has motivated the development of federated learning algorithms, which allow multiple data owners to train collaboratively and use a shared model whilst keeping local data private. However, many of these algorithms focus on obtaining point estimates of model parameters, rather than probabilistic estimates capable of capturing model uncertainty, which is essential in many applications. Variational inference (VI) has become the method of choice for fitting many modern probabilistic models. In this paper we introduce partitioned variational inference (PVI), a general framework for performing VI in the federated setting. We develop new supporting theory for PVI, demonstrating a number of properties that make it an attractive choice for practitioners; use PVI to unify a wealth of fragmented, yet related literature; and provide empirical results that showcase the effectiveness of PVI in a variety of federated settings.  ( 2 min )
    Normalizing flows for atomic solids. (arXiv:2111.08696v2 [physics.comp-ph] UPDATED)
    We present a machine-learning approach, based on normalizing flows, for modelling atomic solids. Our model transforms an analytically tractable base distribution into the target solid without requiring ground-truth samples for training. We report Helmholtz free energy estimates for cubic and hexagonal ice modelled as monatomic water as well as for a truncated and shifted Lennard-Jones system, and find them to be in excellent agreement with literature values and with estimates from established baseline methods. We further investigate structural properties and show that the model samples are nearly indistinguishable from the ones obtained with molecular dynamics. Our results thus demonstrate that normalizing flows can provide high-quality samples and free energy estimates without the need for multi-staging.  ( 2 min )
    Forecasting Brain Activity Based on Models of Spatio-Temporal Brain Dynamics: A Comparison of Graph Neural Network Architectures. (arXiv:2112.04266v2 [q-bio.NC] UPDATED)
    Comprehending the interplay between spatial and temporal characteristics of neural dynamics can contribute to our understanding of information processing in the human brain. Graph neural networks (GNNs) provide a new possibility to interpret graph structured signals like those observed in complex brain networks. In our study we compare different spatio-temporal GNN architectures and study their ability to model neural activity distributions obtained in functional MRI (fMRI) studies. We evaluate the performance of the GNN models on a variety of scenarios in MRI studies and also compare it to a VAR model, which is currently often used for directed functional connectivity analysis. We show that by learning localized functional interactions on the anatomical substrate, GNN based approaches are able to robustly scale to large network studies, even when available data are scarce. By including anatomical connectivity as the physical substrate for information propagation, such GNNs also provide a multi-modal perspective on directed connectivity analysis, offering a novel possibility to investigate the spatio-temporal dynamics in brain networks.  ( 2 min )
    Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control. (arXiv:2110.01052v4 [cs.LG] UPDATED)
    We introduce a framework for calibrating machine learning models so that their predictions satisfy explicit, finite-sample statistical guarantees. Our calibration algorithm works with any underlying model and (unknown) data-generating distribution and does not require model refitting. The framework addresses, among other examples, false discovery rate control in multi-label classification, intersection-over-union control in instance segmentation, and the simultaneous control of the type-1 error of outlier detection and confidence set coverage in classification or regression. Our main insight is to reframe the risk-control problem as multiple hypothesis testing, enabling techniques and mathematical arguments different from those in the previous literature. We use our framework to provide new calibration methods for several core machine learning tasks with detailed worked examples in computer vision and tabular medical data.  ( 2 min )
    Self-organizing Democratized Learning: Towards Large-scale Distributed Learning Systems. (arXiv:2007.03278v3 [cs.LG] UPDATED)
    Emerging cross-device artificial intelligence (AI) applications require a transition from conventional centralized learning systems towards large-scale distributed AI systems that can collaboratively perform complex learning tasks. In this regard, democratized learning (Dem-AI) lays out a holistic philosophy with underlying principles for building large-scale distributed and democratized machine learning systems. The outlined principles are meant to study a generalization in distributed learning systems that goes beyond existing mechanisms such as federated learning. Moreover, such learning systems rely on hierarchical self-organization of well-connected distributed learning agents who have limited and highly personalized data and can evolve and regulate themselves based on the underlying duality of specialized and generalized processes. Inspired by Dem-AI philosophy, a novel distributed learning approach is proposed in this paper. The approach consists of a self-organizing hierarchical structuring mechanism based on agglomerative clustering, hierarchical generalization, and corresponding learning mechanism. Subsequently, hierarchical generalized learning problems in recursive forms are formulated and shown to be approximately solved using the solutions of distributed personalized learning problems and hierarchical update mechanisms. To that end, a distributed learning algorithm, namely DemLearn is proposed. Extensive experiments on benchmark MNIST, Fashion-MNIST, FE-MNIST, and CIFAR-10 datasets show that the proposed algorithms demonstrate better results in the generalization performance of learning models in agents compared to the conventional FL algorithms. The detailed analysis provides useful observations to further handle both the generalization and specialization performance of the learning models in Dem-AI systems.  ( 2 min )
    Adversarial Meta-Learning of Gamma-Minimax Estimators That Leverage Prior Knowledge. (arXiv:2012.05465v2 [stat.ME] UPDATED)
    Bayes estimators are well known to provide a means to incorporate prior knowledge that can be expressed in terms of a single prior distribution. However, when this knowledge is too vague to express with a single prior, an alternative approach is needed. Gamma-minimax estimators provide such an approach. These estimators minimize the worst-case Bayes risk over a set $\Gamma$ of prior distributions that are compatible with the available knowledge. Traditionally, Gamma-minimaxity is defined for parametric models. In this work, we define Gamma-minimax estimators for general models and propose adversarial meta-learning algorithms to compute them when the set of prior distributions is constrained by generalized moments. Accompanying convergence guarantees are also provided. We also introduce a neural network class that provides a rich, but finite-dimensional, class of estimators from which a Gamma-minimax estimator can be selected. We illustrate our method in two settings, namely entropy estimation and a prediction problem that arises in biodiversity studies.  ( 2 min )
    Signal Recovery with Non-Expansive Generative Network Priors. (arXiv:2204.13599v1 [eess.SP])
    We study compressive sensing with a deep generative network prior. Initial theoretical guarantees for efficient recovery from compressed linear measurements have been developed for signals in the range of a ReLU network with Gaussian weights and logarithmic expansivity: that is when each layer is larger than the previous one by a logarithmic factor. It was later shown that constant expansivity is sufficient for recovery. It has remained open whether the expansivity can be relaxed allowing for networks with contractive layers, as often the case of real generators. In this work we answer this question, proving that a signal in the range of a Gaussian generative network can be recovered from a few linear measurements provided that the width of the layers is proportional to the input layer size (up to log factors). This condition allows the generative network to have contractive layers. Our result is based on showing that Gaussian matrices satisfy a matrix concentration inequality, which we term Range Restricted Weight Distribution Condition (R2WDC), and weakens the Weight Distribution Condition (WDC) upon which previous theoretical guarantees were based on. The WDC has also been used to analyze other signal recovery problems with generative network priors. By replacing the WDC with the R2WDC, we are able to extend previous results for signal recovery with expansive generative network priors to non-expansive ones. We discuss these extensions for phase retrieval, denoising, and spiked matrix recovery.  ( 2 min )
    Multiplicative Updates for NMF with $\beta$-Divergences under Disjoint Equality Constraints. (arXiv:2010.16223v2 [cs.LG] UPDATED)
    Nonnegative matrix factorization (NMF) is the problem of approximating an input nonnegative matrix, $V$, as the product of two smaller nonnegative matrices, $W$ and $H$. In this paper, we introduce a general framework to design multiplicative updates (MU) for NMF based on $\beta$-divergences ($\beta$-NMF) with disjoint equality constraints, and with penalty terms in the objective function. By disjoint, we mean that each variable appears in at most one equality constraint. Our MU satisfy the set of constraints after each update of the variables during the optimization process, while guaranteeing that the objective function decreases monotonically. We showcase this framework on three NMF models, and show that it competes favorably the state of the art: (1)~$\beta$-NMF with sum-to-one constraints on the columns of $H$, (2) minimum-volume $\beta$-NMF with sum-to-one constraints on the columns of $W$, and (3) sparse $\beta$-NMF with $\ell_2$-norm constraints on the columns of $W$.  ( 2 min )
    Deep Orientation-Aware Functional Maps: Tackling Symmetry Issues in Shape Matching. (arXiv:2204.13453v1 [cs.CV])
    State-of-the-art fully intrinsic networks for non-rigid shape matching often struggle to disambiguate the symmetries of the shapes leading to unstable correspondence predictions. Meanwhile, recent advances in the functional map framework allow to enforce orientation preservation using a functional representation for tangent vector field transfer, through so-called complex functional maps. Using this representation, we propose a new deep learning approach to learn orientation-aware features in a fully unsupervised setting. Our architecture is built on top of DiffusionNet, making it robust to discretization changes. Additionally, we introduce a vector field-based loss, which promotes orientation preservation without using (often unstable) extrinsic descriptors.  ( 2 min )
    Predicting single-cell perturbation responses for unseen drugs. (arXiv:2204.13545v1 [cs.LG])
    Single-cell transcriptomics enabled the study of cellular heterogeneity in response to perturbations at the resolution of individual cells. However, scaling high-throughput screens (HTSs) to measure cellular responses for many drugs remains a challenge due to technical limitations and, more importantly, the cost of such multiplexed experiments. Thus, transferring information from routinely performed bulk RNA-seq HTS is required to enrich single-cell data meaningfully. We introduce a new encoder-decoder architecture to study the perturbational effects of unseen drugs. We combine the model with a transfer learning scheme and demonstrate how training on existing bulk RNA-seq HTS datasets can improve generalisation performance. Better generalisation reduces the need for extensive and costly screens at single-cell resolution. We envision that our proposed method will facilitate more efficient experiment designs through its ability to generate in-silico hypotheses, ultimately accelerating targeted drug discovery.  ( 2 min )
    ELM: Embedding and Logit Margins for Long-Tail Learning. (arXiv:2204.13208v1 [cs.LG])
    Long-tail learning is the problem of learning under skewed label distributions, which pose a challenge for standard learners. Several recent approaches for the problem have proposed enforcing a suitable margin in logit space. Such techniques are intuitive analogues of the guiding principle behind SVMs, and are equally applicable to linear models and neural models. However, when applied to neural models, such techniques do not explicitly control the geometry of the learned embeddings. This can be potentially sub-optimal, since embeddings for tail classes may be diffuse, resulting in poor generalization for these classes. We present Embedding and Logit Margins (ELM), a unified approach to enforce margins in logit space, and regularize the distribution of embeddings. This connects losses for long-tail learning to proposals in the literature on metric embedding, and contrastive learning. We theoretically show that minimising the proposed ELM objective helps reduce the generalisation gap. The ELM method is shown to perform well empirically, and results in tighter tail class embeddings.  ( 2 min )
    Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models: Extension. (arXiv:1905.10395v5 [cs.LG] UPDATED)
    We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the currently best-performing worker (leader). Our method differs from the parameter-averaging scheme EASGD in a number of ways: (i) our objective formulation does not change the location of stationary points compared to the original optimization problem; (ii) we avoid convergence decelerations caused by pulling local workers descending to different local minima to each other (i.e. to the average of their parameters); (iii) our update by design breaks the curse of symmetry (the phenomenon of being trapped in poorly generalizing sub-optimal solutions in symmetric non-convex landscapes); and (iv) our approach is more communication efficient since it broadcasts only parameters of the leader rather than all workers. We provide theoretical analysis of the batch version of the proposed algorithm, which we call Leader Gradient Descent (LGD), and its stochastic variant (LSGD). Finally, we implement an asynchronous version of our algorithm and extend it to the multi-leader setting, where we form groups of workers, each represented by its own local leader (the best performer in a group), and update each worker with a corrective direction comprised of two attractive forces: one to the local, and one to the global leader (the best performer among all workers). The multi-leader setting is well-aligned with current hardware architecture, where local workers forming a group lie within a single computational node and different groups correspond to different nodes. For training convolutional neural networks, we empirically demonstrate that our approach compares favorably to state-of-the-art baselines. This work is a gentle extension of [2].  ( 3 min )
    Overcoming Catastrophic Forgetting via Direction-Constrained Optimization. (arXiv:2011.12581v2 [cs.LG] UPDATED)
    This paper studies a new design of the optimization algorithm for training deep learning models with a fixed architecture of the classification network in a continual learning framework. The training data is non-stationary and the non-stationarity is imposed by a sequence of distinct tasks. We first analyze a deep model trained on only one learning task in isolation and identify a region in network parameter space, where the model performance is close to the recovered optimum. We provide empirical evidence that this region resembles a cone that expands along the convergence direction. We study the principal directions of the trajectory of the optimizer after convergence and show that traveling along a few top principal directions can quickly bring the parameters outside the cone but this is not the case for the remaining directions. We argue that catastrophic forgetting in a continual learning setting can be alleviated when the parameters are constrained to stay within the intersection of the plausible cones of individual tasks that were so far encountered during training. Based on this observation we present our direction-constrained optimization (DCO) method, where for each task we introduce a linear autoencoder to approximate its corresponding top forbidden principal directions. They are then incorporated into the loss function in the form of a regularization term for the purpose of learning the coming tasks without forgetting. Furthermore, in order to control the memory growth as the number of tasks increases, we propose a memory-efficient version of our algorithm called compressed DCO (DCO-COMP) that allocates a memory of fixed size for storing all autoencoders. We empirically demonstrate that our algorithm performs favorably compared to other state-of-art regularization-based continual learning methods.  ( 2 min )
    R-MBO: A Multi-surrogate Approach for Preference Incorporation in Multi-objective Bayesian Optimisation. (arXiv:2204.13166v1 [stat.ML])
    Many real-world multi-objective optimisation problems rely on computationally expensive function evaluations. Multi-objective Bayesian optimisation (BO) can be used to alleviate the computation time to find an approximated set of Pareto optimal solutions. In many real-world problems, a decision-maker has some preferences on the objective functions. One approach to incorporate the preferences in multi-objective BO is to use a scalarising function and build a single surrogate model (mono-surrogate approach) on it. This approach has two major limitations. Firstly, the fitness landscape of the scalarising function and the objective functions may not be similar. Secondly, the approach assumes that the scalarising function distribution is Gaussian, and thus a closed-form expression of an acquisition function e.g., expected improvement can be used. We overcome these limitations by building independent surrogate models (multi-surrogate approach) on each objective function and show that the distribution of the scalarising function is not Gaussian. We approximate the distribution using Generalised value distribution. We present an a-priori multi-surrogate approach to incorporate the desirable objective function values (or reference point) as the preferences of a decision-maker in multi-objective BO. The results and comparison with the existing mono-surrogate approach on benchmark and real-world optimisation problems show the potential of the proposed approach.  ( 2 min )
    Unlocking High-Accuracy Differentially Private Image Classification through Scale. (arXiv:2204.13650v1 [cs.LG])
    Differential Privacy (DP) provides a formal privacy guarantee preventing adversaries with access to a machine learning model from extracting information about individual training points. Differentially Private Stochastic Gradient Descent (DP-SGD), the most popular DP training method, realizes this protection by injecting noise during training. However previous works have found that DP-SGD often leads to a significant degradation in performance on standard image classification benchmarks. Furthermore, some authors have postulated that DP-SGD inherently performs poorly on large models, since the norm of the noise required to preserve privacy is proportional to the model dimension. In contrast, we demonstrate that DP-SGD on over-parameterized models can perform significantly better than previously thought. Combining careful hyper-parameter tuning with simple techniques to ensure signal propagation and improve the convergence rate, we obtain a new SOTA on CIFAR-10 of 81.4% under (8, 10^{-5})-DP using a 40-layer Wide-ResNet, improving over the previous SOTA of 71.7%. When fine-tuning a pre-trained 200-layer Normalizer-Free ResNet, we achieve a remarkable 77.1% top-1 accuracy on ImageNet under (1, 8*10^{-7})-DP, and achieve 81.1% under (8, 8*10^{-7})-DP. This markedly exceeds the previous SOTA of 47.9% under a larger privacy budget of (10, 10^{-6})-DP. We believe our results are a significant step towards closing the accuracy gap between private and non-private image classification.  ( 2 min )
    AlphaZero-Inspired General Board Game Learning and Playing. (arXiv:2204.13307v1 [cs.LG])
    Recently, the seminal algorithms AlphaGo and AlphaZero have started a new era in game learning and deep reinforcement learning. While the achievements of AlphaGo and AlphaZero - playing Go and other complex games at super human level - are truly impressive, these architectures have the drawback that they are very complex and require high computational resources. Many researchers are looking for methods that are similar to AlphaZero, but have lower computational demands and are thus more easily reproducible. In this paper, we pick an important element of AlphaZero - the Monte Carlo Tree Search (MCTS) planning stage - and combine it with reinforcement learning (RL) agents. We wrap MCTS for the first time around RL n-tuple networks to create versatile agents that keep at the same time the computational demands low. We apply this new architecture to several complex games (Othello, ConnectFour, Rubik's Cube) and show the advantages achieved with this AlphaZero-inspired MCTS wrapper. In particular, we present results that this AlphaZero-inspired agent is the first one trained on standard hardware (no GPU or TPU) to beat the very strong Othello program Edax up to and including level 7 (where most other algorithms could only defeat Edax up to level 2).  ( 2 min )
    Standardized Evaluation of Machine Learning Methods for Evolving Data Streams. (arXiv:2204.13625v1 [cs.LG])
    Due to the unspecified and dynamic nature of data streams, online machine learning requires powerful and flexible solutions. However, evaluating online machine learning methods under realistic conditions is difficult. Existing work therefore often draws on different heuristics and simulations that do not necessarily produce meaningful and reliable results. Indeed, in the absence of common evaluation standards, it often remains unclear how online learning methods will perform in practice or in comparison to similar work. In this paper, we propose a comprehensive set of properties for high-quality machine learning in evolving data streams. In particular, we discuss sensible performance measures and evaluation strategies for online predictive modelling, online feature selection and concept drift detection. As one of the first works, we also look at the interpretability of online learning methods. The proposed evaluation standards are provided in a new Python framework called float. Float is completely modular and allows the simultaneous integration of common libraries, such as scikit-multiflow or river, with custom code. Float is open-sourced and can be accessed at https://github.com/haugjo/float. In this sense, we hope that our work will contribute to more standardized, reliable and realistic testing and comparison of online machine learning methods.  ( 2 min )
    Quality Inference in Federated Learning with Secure Aggregation. (arXiv:2007.06236v3 [cs.LG] UPDATED)
    Federated learning algorithms are developed both for efficiency reasons and to ensure the privacy and confidentiality of personal and business data, respectively. Despite no data being shared explicitly, recent studies showed that the mechanism could still leak sensitive information. Hence, secure aggregation is utilized in many real-world scenarios to prevent attribution to specific participants. In this paper, we focus on the quality of individual training datasets and show that such quality information could be inferred and attributed to specific participants even when secure aggregation is applied. Specifically, through a series of image recognition experiments, we infer the relative quality ordering of participants. Moreover, we apply the inferred quality information to detect misbehaviours, to stabilize training performance, and to measure the individual contributions of participants.  ( 2 min )
    FedShuffle: Recipes for Better Use of Local Work in Federated Learning. (arXiv:2204.13169v1 [cs.LG])
    The practice of applying several local updates before aggregation across clients has been empirically shown to be a successful approach to overcoming the communication bottleneck in Federated Learning (FL). In this work, we propose a general recipe, FedShuffle, that better utilizes the local updates in FL, especially in the heterogeneous regime. Unlike many prior works, FedShuffle does not assume any uniformity in the number of updates per device. Our FedShuffle recipe comprises four simple-yet-powerful ingredients: 1) local shuffling of the data, 2) adjustment of the local learning rates, 3) update weighting, and 4) momentum variance reduction (Cutkosky and Orabona, 2019). We present a comprehensive theoretical analysis of FedShuffle and show that both theoretically and empirically, our approach does not suffer from the objective function mismatch that is present in FL methods which assume homogeneous updates in heterogeneous FL setups, e.g., FedAvg (McMahan et al., 2017). In addition, by combining the ingredients above, FedShuffle improves upon FedNova (Wang et al., 2020), which was previously proposed to solve this mismatch. We also show that FedShuffle with momentum variance reduction can improve upon non-local methods under a Hessian similarity assumption. Finally, through experiments on synthetic and real-world datasets, we illustrate how each of the four ingredients used in FedShuffle helps improve the use of local updates in FL.  ( 2 min )
    A Locally Adaptive Interpretable Regression. (arXiv:2005.03350v4 [stat.ML] UPDATED)
    Machine learning models with both good predictability and high interpretability are crucial for decision support systems. Linear regression is one of the most interpretable prediction models. However, the linearity in a simple linear regression worsens its predictability. In this work, we introduce a locally adaptive interpretable regression (LoAIR). In LoAIR, a metamodel parameterized by neural networks predicts percentile of a Gaussian distribution for the regression coefficients for a rapid adaptation. Our experimental results on public benchmark datasets show that our model not only achieves comparable or better predictive performance than the other state-of-the-art baselines but also discovers some interesting relationships between input and target variables such as a parabolic relationship between CO2 emissions and Gross National Product (GNP). Therefore, LoAIR is a step towards bridging the gap between econometrics, statistics, and machine learning by improving the predictive ability of linear regression without depreciating its interpretability.  ( 2 min )
    Semantic Communication: An Information Bottleneck View. (arXiv:2204.13366v1 [cs.IT])
    Motivated by recent success of machine learning tools at the PHY layer and driven by high bandwidth demands of the next wireless communication standard 6G, the old idea of semantic communication by Weaver from 1949 has received considerable attention. It breaks with the classic design paradigm according to Shannon by aiming to transmit the meaning of a message rather than its exact copy and thus potentially allows for savings in bandwidth. In this work, inspired by Weaver, we propose an information-theoretic framework where the semantic context is explicitly introduced into probabilistic models. In particular, for bandwidth efficient transmission, we define semantic communication system design as an Information Bottleneck optimization problem and consider important implementation aspects. Further, we uncover the restrictions of the classic 5G communication system design w.r.t. semantic context. Notably, based on the example of distributed image classification, we reveal the huge potential of a semantic communication system design. Numerical results show a tremendous saving in bandwidth of 20 dB with our proposed approach ISCNet compared to a classic PHY layer design.  ( 2 min )
    On the Use of Dimension Reduction or Signal Separation Methods for Nitrogen River Pollution Source Identification. (arXiv:2204.13182v1 [stat.AP])
    Identification of the current and expected future pollution sources to rivers is crucial for sound environmental management. For this purpose numerous approaches were proposed that can be clustered under physical based models, stable isotope analysis and mixing methods, mass balance methods, time series analysis, land cover analysis, and spatial statistics. Another extremely common method is Principal Component Analysis, as well as its modifications, such as Absolute Principal Component Score. they have been applied to the source identification problems for nitrogen entry to rivers. This manuscript is checking whether PCA can really be a powerful method to uncover nitrogen pollution sources considering its theoretical background and assumptions. Moreover, slightly similar techniques, Independent Component Analysis and Factor Analysis will also be considered.  ( 2 min )
    On the Normalizing Constant of the Continuous Categorical Distribution. (arXiv:2204.13290v1 [stat.ML])
    Probability distributions supported on the simplex enjoy a wide range of applications across statistics and machine learning. Recently, a novel family of such distributions has been discovered: the continuous categorical. This family enjoys remarkable mathematical simplicity; its density function resembles that of the Dirichlet distribution, but with a normalizing constant that can be written in closed form using elementary functions only. In spite of this mathematical simplicity, our understanding of the normalizing constant remains far from complete. In this work, we characterize the numerical behavior of the normalizing constant and we present theoretical and methodological advances that can, in turn, help to enable broader applications of the continuous categorical distribution. Our code is available at https://github.com/cunningham-lab/cb_and_cc/.  ( 2 min )
    Asymptotic Inference for Infinitely Imbalanced Logistic Regression. (arXiv:2204.13231v1 [math.ST])
    In this paper we extend the work of Owen (2007) by deriving a second order expansion for the slope parameter in logistic regression, when the size of the majority class is unbounded and the minority class is finite. More precisely, we demonstrate that the second order term converges to a normal distribution and explicitly compute its variance, which surprisingly once again depends only on the mean of the minority class points and not their arrangement under mild regularity assumptions. In the case that the majority class is normally distributed, we illustrate that the variance of the the limiting slope depends exponentially on the z-score of the average of the minority class's points with respect to the majority class's distribution. We confirm our results by Monte Carlo simulations.  ( 2 min )

  • Open

    Should I become an Art therapist
    submitted by /u/BalanceSubstantial66 [link] [comments]
    Is it easier to mimic a model based on its input/output or to train an original model in the first place ?
    An original model is trained with a data-set, typically labeled by humans (say for classifiers). However, what if one would like to copycat a closed model only exposed through an API ? By doing this, the data-set instead would be the input / output of the original model. ​ Is it easier to train the copycat model or the original one ? How much data would be required to train the copycat versus the original one ? Are there practical examples or this happening ? Can it possibly worth it and if so under which circumstances ? How much data would be required for example for the notorious Dall-e 2? submitted by /u/Wishmaster04 [link] [comments]  ( 1 min )
    Disney princesses according to AI. Is this done manually or through an AI app?
    submitted by /u/p0goniphaft111 [link] [comments]
    Specification gaming: the flip side of AI ingenuity
    submitted by /u/estasfuera [link] [comments]
    Beyond interpretability: developing a language to shape our relationships with AI
    submitted by /u/estasfuera [link] [comments]
    Guide to Iteratively Tuning GNN's
    submitted by /u/aidev2040 [link] [comments]
    Last Week in AI: AI Driving Instructors, MASSIVE Speech Dataset, AI ’Show Stealers’, Tesla Jet Crash
    submitted by /u/regalalgorithm [link] [comments]
    Best way to format dialogue to fine tune GPT-J, 3 ...
    Hi all, a quick questions, given that online I'm not finding that many info: how would you format a dialogue between people to fine tune a GPT model (be it GPT-J, GPT-3 etc.)? For example, if I want the GPT model to create a new dialogue from the show "Friends" (choosed this because all dialogues are available online https://www.kaggle.com/datasets/blessondensil294/friends-tv-series-screenplay-script?resource=download ) how should I format the input? ​ [Scene: Central Perk, Chandler, Joey, Phoebe, and Monica are there.] Monica: There's nothing to tell! He's just some guy I work with! Joey: C'mon, you're going out with the guy! There's gotta be something wrong with him! Chandler: All right Joey, be nice. So does he have a hump? A hump and a hairpiece? Phoebe: Wait, does he eat chalk? (They all stare, bemused.) Phoebe: Just, 'cause, I don't want her to go through what I went through with Carl- oh! Monica: Okay, everybody relax. This is not even a date. It's just two people going out to dinner and- not having sex. Chandler: Sounds like a date to me. submitted by /u/Sgnarf1989 [link] [comments]  ( 1 min )
    Meta's AI team searches for the secret trick of human intelligence
    submitted by /u/much_successes [link] [comments]
    I am new to Artificial Intelligence and I need your worthy suggestions about AI.
    Hey guys, I am new to AI. Please suggest some good and new topics/technologies that are supported by AI which will be very informative and beneficial for me in my career. Every suggestion is a high priority for me. Waiting for your replies. Thanks submitted by /u/adilonreddit1 [link] [comments]  ( 1 min )
    AI Augmentation in this Assemblage 23 music video
    submitted by /u/LightOfAntara [link] [comments]
    How Human Pose Estimation Technology Can Be Used in 2022?
    Hi everyone, I want to share with you an article that I worked on with my colleague about the use cases of human pose estimation. I would love it if you could check it out and share your ideas for using this technology in the comments below. https://mobidev.biz/blog/human-pose-estimation-technology-guide submitted by /u/Data-Power [link] [comments]
    Deepmind Researchers Propose Fair Normalizing Flows (FNF): A Rigorous Approach For Learning Fair Representations
    ​ https://preview.redd.it/2hryjv40cgw81.png?width=1024&format=png&auto=webp&s=30f13567b83b3cdfe5558d09f188b2e69da7e73f Fair representation learning has emerged as one of the most promising techniques to encode data into new, impartial representations with high utility as machine learning is increasingly utilized in settings that potentially harm humans. Fair representation means presenting data without regard to gender, color, or other factors. Due to human-introduced bias, these biases are found in the word vector representations in language models. The goal of Learning Fair Representation is to reduce bias by decreasing the semantic distance between biassed terms. The goal of fair representation learning is to guarantee that representations are useful for a variety of prediction tasks and that sensitive aspects of the original data cannot be extracted from them. Adversarial training, which combines an encoder aiming to turn data into a fair representation with an adversary attempting to recover sensitive features from the representation, is the most used method for learning fair representations. However, recent research has discovered that these methods do not provide completely fair representations: stronger opponents can recover sensitive features. This could allow malicious or uneducated users to discriminate using the available representations. The issue of fair representation has lately risen in prominence as regulators draught guidelines on the ethical use of AI, indicating that any company that cannot ensure non-discrimination would be held liable for the data produced. Continue Reading Paper: https://openreview.net/pdf?id=BrFIKuxrZE Github: https://github.com/eth-sri/fnf submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    It’s not a place (made with starryai)
    submitted by /u/Losthel [link] [comments]
    Night Cafe "fire on the nuclear plant in the amazon by Simon Stalenhag"
    submitted by /u/brunovianna [link] [comments]
    Ladybugs (A.I. animation + sound design)
    submitted by /u/nenomancer [link] [comments]  ( 1 min )
    Adaptive Multi-Strategy Market-Making Agent For Volatile Markets
    submitted by /u/akolonin [link] [comments]  ( 1 min )
    Artificial Nightmares: Hall Monitor || Clip Guided Diffusion AI Art Video [4K 20 FPS]
    submitted by /u/Thenamessd [link] [comments]
  • Open

    Playing Chess With Offline Reinforcement Learning
    submitted by /u/Bellerb [link] [comments]  ( 1 min )
    How to use RL for combinatorial optimization
    I am trying to use reinforcement for combinatorial optimization. I coded up a toy example for the travelling salesman but I am confused by something. In stable baselines 3 there is a variable "done". If I set it to optimum value (that is the length of the shortest route through all the cities computer by a different method) the RL algorithm never finds it. What should done be set to? Or more generally how do you do optimization of this sort using RL. I can just set it so that it is done after 1000 steps but then it is spending a lot of time finding the best way to take the first 1000 steps which isn't quite the point. submitted by /u/wiggyhat [link] [comments]  ( 2 min )
    Can agents act simultaneously with no notion of turn-taking?
    I was reading this paper and in section 3 they claim that agents act simultaneously and there is no notion of turn-taking: https://arxiv.org/abs/2104.07750 I was wondering how this works. What I'm used to seeing is a for loop in which, one after the other, agents execute the step function and interact with the environment. How does this change if all agents act simultaneously? submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    How to mitigate catastrophic forgetting for an intelligent agent using replay memory and Deep Reinforcement Learning algorithm
    Hello everyone. I built a use case to instruct an agent, which simulates a herd of predators, to encircle a prey. Episodes are non-terminal and have a fixed number of steps. The agent uses a neural network to decide which actions to take at each step and is instructed by a remember/replay mechanism using a memory of past events. I started the experiment using absolute coordinates in the state returned by the environment and the result is displayed in the following image, where the total reward values at the end of each episode ​​are displayed. https://preview.redd.it/oscet38c0gw81.png?width=640&format=png&auto=webp&s=092fc92ee56c919bdc1851bb9f1ef9bdc585bfde In this case the results were encouraging. However, I wanted to create a more generic use case, using vectors that represent the position of the prey with respect to predators instead of absolute coordinates, in order to make the task more generic. Unfortunately, the best result I was able to obtain is the one shown in the image below where, apparently, I have a deterioration after a certain number of episodes. ​ https://preview.redd.it/yoiygwxe0gw81.png?width=640&format=png&auto=webp&s=06b7082cc0ebb80e3f5e5ce8307d75cd6451f79b It seems to me a case of catastrophic forgetting (I could be wrong) and I managed to mitigate it by increasing the memory, retaining the results of the older episodes, lowering the learning rate of the neural network and using a learning decay algorithm, but I have not succeeded to eliminate the phenomenon completely. Anyone have any advice? submitted by /u/kaeldric__ [link] [comments]  ( 2 min )
    Speeding up custom implementations.
    Hi all! I've been implementing some of the model-free algorithms in recent times. Comparing their performance to open-source libraries, however, learning takes severely longer in a computational sense. I wonder what tricks libraries such as stable-baselines3 use to increase frames per seconds and accelerate policy updates. So far, I vectorized the environment sampling & the bottleneck seems to be computing the updates for the agent. Thanks a lot! submitted by /u/Internal-Brush4929 [link] [comments]  ( 1 min )
    Any blog or video series that is available to run TD3 algorithm for path planning purpose . Trained agents deployed in hardware level
    submitted by /u/ajithvallabai [link] [comments]  ( 1 min )
    Microsoft AI Researchers Introduce PPE: A Mathematically Guaranteed Reinforcement learning (RL) Algorithm For Exogenous Noise
    Reinforcement learning (RL) is a machine learning training strategy that rewards desirable behaviors while penalizing undesirable ones. A reinforcement learning agent can perceive and comprehend its surroundings, act, and learn through trial and error in general. Although RL agents can heuristically solve some problems, such as assisting a robot in navigating to a specific location in a given environment, there is no guarantee that they will be able to handle problems in settings they have not yet encountered. The capacity of these models to recognize the robot and any obstacles in its path, but not changes in its surrounding environment that occur independently of the agent, which we refer to as exogenous noise, is critical to their success. Existing RL algorithms are not powerful enoug…  ( 2 min )
  • Open

    [D] Usage Optimized AWS GPUs, 57-63% off On-Demand Prices
    www.usage.ai ​ Usage AI bundles 3-year no-upfront RIs on AWS with guaranteed buyback -- so users get all the savings of 3-year RIs with none of the commitment. I helped engineer the product. Here to answer any questions! submitted by /u/usage-team [link] [comments]
    [P] Blog post: Learning JAX by Learning to Learn
    I recently published a new blog post, which goes over how meta-learned optimizers work and how to implement them in JAX. JAX's composable function transforms make implementing meta-learning algorithms very straightforward. If you're interested in JAX or meta learning give it a read! Blog post: https://teddykoker.com/2022/04/learning-to-learn-jax/ Code: https://github.com/teddykoker/learning-to-learn-jax Original Paper (NeurIPS '16): https://arxiv.org/abs/1606.04474 submitted by /u/tomkoker [link] [comments]  ( 1 min )
    [P] Introducing FlowMeter for network packet analysis
    We’ve released a new open source project - https://github.com/deepfence/FlowMeter - to analyze and classify packet captures using ML techniques. FlowMeter is an experimental project; we’re using it to evaluate how effectively we can train an ML model to discriminate between different types of traffic flows, e.g. normal and anomalous. You can use sample data from various sources (see the README), or gather packet captures using PacketStreamer https://github.com/deepfence/PacketStreamer or other pcap tools. More information in the README, here: https://github.com/deepfence/FlowMeter and this blogpost: https://medium.com/@siddharthsatpathy.ss/introducing-flowmeter-97e0507862b6 Hope some people find it useful; we’d welcome any feedback, thank you. submitted by /u/sidd_ss [link] [comments]  ( 1 min )
    [P] open-source python library for making machine learning demos that runs in the browser or inside a jupyter notebook/google colab, package is available on PyPI
    https://gradio.app/ is for demoing machine learning models https://reddit.com/link/ueod3x/video/75e1ozmjihw81/player Prerequisite: Python 3.7+ and that's it! Quick Start To get Gradio running with a simple "Hello, World" example, follow these three steps: Install Gradio from pip. ​ pip install gradio Run the code below as a Python script or in a Python notebook (or in a colab notebook). ​ import gradio as gr def greet(name): return "Hello " + name + "!!" demo = gr.Interface(fn=greet, inputs="text", outputs="text") if __name__ == "__main__": demo.launch() The interface below will appear automatically within the Python notebook, or pop in a browser on http://localhost:7860 if running from a script. see more in the getting started guide: https://gradio.app/getting_started/ submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 2 min )
    [P] Hot Reloading for Pandas
    Hi guys I thought you might find my project useful. It's called Reloadium and saves a lot of time during python development, especially in data science and machine learning field. More details here: https://github.com/reloadware/reloadium Hot Reloading dataframes Modifying and fixing code during debugging What do you guys think about it? submitted by /u/kwazar90 [link] [comments]  ( 1 min )
    [D] SHAP method to get feature importance: is linearity realistic?
    The SHAP method assumes linear relationship between the feature effects (see definition "Additive feature attribution methods" in the paper). But is this assumption realistic? submitted by /u/savoga [link] [comments]  ( 1 min )
    [D] Need to find a good self-hosted medical image annotation tool.
    Hi. I'm trying to come up with a solution for a medical image annotation system for our laboratory. We need it to be self-hosted (in order to only work in the uni's internal network), and open source (since the funds are limited). So far I've only found out about https://lab.vindr.ai/dashboard/projects but the documentation is really bad and I could not launch it using docker compose. I've also found MONAILabel(https://github.com/Project-MONAI/MONAILabel), but it apparently requires GPU which makes it really expensive. I'd rather find a cpu based solution because our task is not that complex. We only get some Dicom files (each have studies in them), and want to label them. submitted by /u/feryet [link] [comments]  ( 1 min )
    Machine Learning in Fantasy Premier League? [D]
    https://medium.com/@subin.sen7/our-client-is-the-worlds-best-fpl-player-this-is-not-clickbait-51481fae76c9 Is it possible to use models to beat the market in Fantasy Football? submitted by /u/AI_FPL [link] [comments]
    [P] XGboost, sklearn and others running over encrypted data
    Hello everyone! Following this post numpy in fhe we are releasing a new lib that allows popular machine learning frameworks to run over encrypted data: https://github.com/zama-ai/concrete-ml Currently this supports xgboost and many sklearn models. We also support pytorch to some extent. We are trying to closely follow sklearn API (when relevant) to make the use easy to machine learning practitioners. Happy to hear any feedback on this ! submitted by /u/strojax [link] [comments]  ( 1 min )
    [D] Is Tensorflow.js still much slower than Tensorflow
    Last time I used Tensorflow was 3 years ago and it was a resource hog as well as very slow. I intend to get back into machine learning and wondering if I should give Tensorflow.js a go again using the Nodejs backend since this will be an electron app. Or should I just go straight for Tensorflow. Ideally I will need about 15 fps while taking up minimal resources submitted by /u/manrayboy [link] [comments]  ( 1 min )
    [D] Collaboration-first machine learning platform that enables you to build, train, track, and share your ML projects simply with a few lines of code
    Hey r/MachineLearning, I'm Derrick from Layer (layer.ai) - the collaboration-first machine learning platform that enables you to build, train, track, and share your ML projects simply with a few lines of code. We are soft-launching today! I’ve been working on Layer for the past 2 years with an awesome team around the world. We really poured our hearts and minds into Layer and hope you will like it. Your feedback would be very appreciated! Layer Demo To get started, you can simply run our Quickstart Example! How is Layer different from other tools? Although there are plenty of ML and DS tooling products, we believe that there is still a large gap around collaboration. Many data science projects are hosted on GitHub, which, in our experience, does not provide sufficient depth and abs…  ( 4 min )
  • Open

    How to Tune Graph Neural Networks
    submitted by /u/aidev2040 [link] [comments]
  • Open

    A one-up on motion capture
    A new neural network approach captures the characteristics of a physical system’s dynamic motion from video, regardless of rendering configuration or image differences.  ( 7 min )
    Engineers use artificial intelligence to capture the complexity of breaking waves
    Their model’s predictions should help researchers improve ocean climate simulations and hone the design of offshore structures.  ( 6 min )
  • Open

    Extracting Skill-Centric State Abstractions from Value Functions
    Posted by Dhruv Shah, Intern, and Brian Ichter, Research Scientist, Robotics at Google Advances in reinforcement learning (RL) for robotics have enabled robotic agents to perform increasingly complex tasks in challenging environments. Recent results show that robots can learn to fold clothes, dexterously manipulate a rubik’s cube, sort objects by color, navigate complex environments and walk on difficult, uneven terrain. But "short-horizon" tasks such as these, which require very little long-term planning and provide immediate failure feedback, are relatively easy to train compared to many tasks that may confront a robot in a real-world setting. Unfortunately, scaling such short-horizon skills to the abstract, long horizons of real-world tasks is difficult. For example, how would one trai…  ( 7 min )
  • Open

    Techniques to Write Better Python Code
    We write a program to solve a problem or make a tool that we can repeatedly solve a similar problem. For the latter, it is inevitable that we come back to revisit the program we wrote, or someone else is reusing the program we write. There is also a chance that we will encounter data […] The post Techniques to Write Better Python Code appeared first on Machine Learning Mastery.  ( 15 min )
  • Open

    How to get AI to confuse a shark with a clam
    "The Megalodon was a large bivalve, measuring up to 2.5 meters in length. Its shell was covered in spines, and it had a large, powerful jaw for crushing prey." Although the megalodon is the most widely known as a giant prehistoric shark, I recently learned that Megalodon  ( 4 min )
    Bonus: GPT-3 answers my questions about sawflies, badly
    AI Weirdness: the strange side of machine learning  ( 1 min )
  • Open

    ‘I Doubt, Therefore I Am,’ Said AI
    It’s Monday morning, and Paul opens one of several emails sent by his boss, Heather. Her email seems a bit unusual, asking Paul to rush to… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 4 min )
  • Open

    Designing Societally Beneficial Reinforcement Learning Systems
    Deep reinforcement learning (DRL) is transitioning from a research field focused on game playing to a technology with real-world applications. Notable examples include DeepMind’s work on controlling a nuclear reactor or on improving Youtube video compression, or Tesla attempting to use a method inspired by MuZero for autonomous vehicle behavior planning. But the exciting potential for real world applications of RL should also come with a healthy dose of caution - for example RL policies are well known to be vulnerable to exploitation, and methods for safe and robust policy development are an active area of research. At the same time as the emergence of powerful RL systems in the real world, the public and researchers are expressing an increased appetite for fair, aligned, and safe machine …  ( 8 min )

  • Open

    Decision Transformers with Hugging Face
    submitted by /u/hellopaperspace [link] [comments]
    How do you get a global observation?
    Naive question: how do you pass a global observation of the environment to your actor and critic? In other words, is a global observation always available or does it depend on the environment? Can you give me a couple of examples? Thanks :) submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Why do Monte Carlo methods have low bias compared to TD methods in RL. Does the bias term in RL has same meaning as in general in ML terminology?
    submitted by /u/aabra__ka__daabra [link] [comments]  ( 1 min )
    2nd Neural MMO challenge is out! Design your policy to master PvE and PvP challenge.
    submitted by /u/xiaolongzhu [link] [comments]  ( 1 min )
    Parallel environments affect Q learning performance
    Parallel environments can make the learning much more efficient, but sometimes using parallel environments seemed affect the performance. Does anyone know why and how to solve? submitted by /u/Traditional-Ad4492 [link] [comments]  ( 1 min )
    Papers for Green Vehicle Routing Problem
    My current work requires me to work with GVRP. The specific problem is for an EV (agent) to find the optimal route to deliver items based on charge consumption and not discharging. I'm currently modelling it as a graph datastructure although it could be different. Are there any papers (both RL and exact method based) for state of the art that I can look into regarding this. I'm also interested in earlier papers that can help me build understanding, and I can increase complexity slowly. Thanks for the help! submitted by /u/evilBotman [link] [comments]  ( 1 min )
    What is the current SOTA for single-threaded continuous-action control using RL?
    As above. I am interested in RL for robotics, specifically for legged locomotion. I wish to explore RL training on the real robot. Sample efficiency is paramount. Has any progress been made by utilizing, say, RNNs/LSTMs or even Attention ? submitted by /u/pakodanomics [link] [comments]  ( 1 min )
    Is it possible/useful to allow the agent to influence the observations in later timesteps?
    In my current environment, the agent is fed observations from a particular mathematical computation based upon the underlying state. This mathematical computation has hyper-parameters that influence the resulting observation given to the agent. Does it make any sense whatsoever to give these hyper-parameters to the agent within the action space? In this way the agent would be able to adjust its future observations and perhaps it would be able to use this in a smart way. On the other hand, I've heard that changing the environment from under the agents feed during training can lead to major issues with training. One can imagine that learning a dynamic environment is harder than a static one. Any thoughts? submitted by /u/C_BearHill [link] [comments]  ( 1 min )
    treequeues: transfert jax pytrees between processes with very high speed!
    Hello! If you are using jax and you need to pass some pytrees between processes, I may have something for you :) I developed a "treequeue". It is a queue that is made for pytree's nested arrays. The transfer speed is up to 10 times higher than regular queues. This is done by utilizing shared memory arrays and avoiding pickling data. This can be very useful when developing distributed architecture, e.g. distributed reinforcement learning where speed is at the upmost importance. In my case this implementation was very useful to remove bottlenecks when implementing RL PBT algorithms! https://github.com/thomashirtz/treequeues Cheers! submitted by /u/krenast [link] [comments]  ( 1 min )
    Could you recommend a paper on self-supervised RL related to catastrophic forgetting?
    Hi, I have problems the agent forgets good states during self-supervised pre-training. It shows good exploration, but as time goes by, only the edge case is explored, showing poor performance at finetune. I found a paper about that before, but It is really hard to find again. It related to reward. Could you recommend a paper related to this? Thanks for reading. submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 1 min )
  • Open

    [D] Any independent researchers ever get published/into conferences?
    For personal context: I worked over time during my bachelors and took grad classes/started research, but financial situation took a hit for the worse and we'll let's say I don't have enough money for a Masters/continue my studies. I'm job hunting now but in a chicken/egg problem most of these jobs want at the very least research (which I know how to conduct) for which they have the resources for. Either way I have a couple research topics I want to explore, but limited resources and want to be realistic here. main question: do you know of anyone who has done it as a non-grad/non-institutional way without the help of established figures in the field. I only know of those that have done it in the way I describe above (either under an established researcher's team, a university, or a company)? Add-on: would appreciate any personal experiences as well, and if anyone has experience with conferences/how long it takes, etc. My research experience has been largely under NDAs so I haven't experienced formal publishing yet (but want to!) submitted by /u/robml [link] [comments]  ( 2 min )
    [D] feature embeddings extraction for image classification
    Is there any rule of thumb to decide from which layer extract feature embeddings for classification tasks based on knn? Does it depend by the architecture ?(conv net , vs vit models, ecc) Should I extract before or after activation function? Edit: Which model do you suggest to start from ? Actually i'm using a resnet50 trained with DINO submitted by /u/Rich_Freedom98 [link] [comments]  ( 1 min )
    [P] Terraform Provider Iterative (TPI) - plugin for ML/AI workloads to spot instance recovery & auto-termination on AWS, GCP, Azure, Kubernetes
    Terraform Provider Iterative (TPI) address the specific needs of machine learning teams - it is an open-source tool extending the functionality of Terraform, the world's most widely used multi-cloud provisioning product. The tool enables full lifecycle management of computing resources and is designed specifically for machine learning pipelines: Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes The tool aims to bridge the gap between devops and data science teams and build on top of Terraform, a tool universally familiar to devops teams, but extend it to suit machine learning needs. It provides to following advantages for your ML workflow: Lower cost: use your preferred cloud provider's existing pricing, including on-demand per-second billing and bulk discounts. Auto-recovery: spot/preemptible instances are cheap but unreliable. TPI reliably and automatically respawns such interrupted instances, caching & restoring the working directory in the cloud even when you are offline. Custom spec: full control over hardware & software requirements via a single config file. submitted by /u/cmstrump [link] [comments]  ( 1 min )
    [R] Flamingo: a Visual Language Model for Few-Shot Learning (from DeepMind)
    Paper (pdf). A link to the paper is also in this blog post. Abstract: Building models that can be rapidly adapted to numerous tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of the proposed Flamingo models, exploring and measuring their ability to rapidly adapt to a variety of image and video understanding benchmarks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple choice visual question-answering. For tasks lying anywhere on this spectrum, we demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples. On many of these benchmarks, Flamingo actually surpasses the performance of models that are fine-tuned on thousands of times more task-specific data. submitted by /u/Wiskkey [link] [comments]  ( 1 min )
    [D] Self-Organizing Maps and Principal Component Analysis.
    Hello all, I am doing my PhD on numerical modelling of solar radiation and I am researching SOMs as a possible tool. I have this problem where I can't decide on a network size, so I have been trying to develop some means to compare SOM neurons from different sized networks. I have read that the 2-d SOM array tends to spam the 2 highest variance principal components. So would it be adecuate to tag my SOM nodes with the first two proyections of PCA? So that I can compare results from different experiments. submitted by /u/juliancanellas [link] [comments]  ( 1 min )
    "[Discussion]", Same MAE Result
    Hello guys, I'm wondering if you can help: we use machine learning models to identify trading opportunities using historical trading data of price and market indicators. The data we use is fine: abundant, accurate and (manually) proven to work over several non-sequential months in the past. The problem is that we keep on getting the exact same MAE result. Any suggestions on solving this issue would be immensely appreciated. submitted by /u/AFAC1410 [link] [comments]  ( 1 min )
    [P] New blog post: how to automatically find label errors in your audio datasets!
    Hi folks, our blog post on how to automatically find label errors in audio datasets has just gone live. We cover the steps to: ⛏️ Perform feature extraction (aka embeddings) on the Spoken Digit dataset with a pre-trained PyTorch model. 🔢 Use cross-validation to generate out-of-sample predicted probabilities for every example in the dataset. 🏷️ Run one line of cleanlab code on these predicted probabilities to identify which audio clips may be mislabeled. 📰 Blog Post + Google Colab: https://cleanlab.ai/blog/label-errors-audio-datasets/ https://preview.redd.it/vpzhwg6jebw81.png?width=1260&format=png&auto=webp&s=3fa79d8a097ae5936e5e6a51e87508adf6190835 submitted by /u/weijinglok [link] [comments]  ( 1 min )
    Shazam App for your brain? [R]
    https://arxiv.org/abs/2202.03265?context=eess Computer vision approach to processing naturalistic brain responses to music. Models achieved state of the art and they used publicly available datasets. With 1 sec of brain signal they can classify the name of the song you're listening to and how much you enjoyed it ~89%. Imagine this but on your airpods, seems wild. submitted by /u/blackliquerish [link] [comments]
    [D] What roles were you able to receive as a PhD with no top tier research?
    It’s looking more and more like I won’t be able to get any top tier publications in my PhD (NeurIPS, ICLR, etc.). I’ve tried, but various factors have limited the experience. I’m curious what people were able to do in this circumstance. I’ll have 3-5 publications in mid-tier venues, plus many years of experience as a software engineer and machine learning engineer (in non-tech companies). People in my program have gone on to work at Amazon, and a few Google, but these were all applied scientist and engineering roles. Does anyone go on to work as a research scientist? National labs seem like a good bet, I’m curious what others have done. submitted by /u/walterkronkite33 [link] [comments]  ( 5 min )
    [D] Have we stopped researching agents?
    It seems to me that the ML research community has stopped talking about agents? As in, machines that act in complex environments by themselves? Ever since GPT-3 came out, bascially every single paper has been related to NLP or Transformers in general. It seems a natural next step would be to try to get agents we could tell what to do and how to do it in natural language, no? Yet that has not been the focus at all. The last time we had wide spread focus on planning and acting was the Starcraft challenge, but that was "solved" so quickly nothing much came of it. Research I know about: - Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents - Tesla in general I guess, with their cars and maybe robots Is it just me or is there just no interest in researching more capable agents right now? submitted by /u/ReasonablyBadass [link] [comments]  ( 2 min )
    [D] Advertisement Allocation models being used?
    There are a bunch of papers on different advertisement allocation techniques (placing ads to valid ad spots) and in algorithms in grad school I learned MaxFlow/MinCut, but I'm wondering what is actually being used in production systems? Linear models, genetic algorithms, reinforcement learning, some neural network? I know that very little of this is published/talked about due to being proprietary. However, I'm wondering because so much gets published, but the more advanced algorithms tend to be complex and not necessarily possible to be deployed in production systems (what I suspect)? submitted by /u/Flipper3 [link] [comments]  ( 2 min )
    [D] ML model dev pipelines
    Hi Folks! It'd really help if you can participate in this poll and share about your ML/DL model workflow before moving it to production. Out of these steps, which one is your most preferred way of integrating ML into your app or business (not specific to any domain): View Poll submitted by /u/fgp121 [link] [comments]  ( 1 min )
    [R] RL papers relating to Green Vehicle Routing Problem
    My current work requires me to work with GVRP. The specific problem is for an EV (agent) to find the optimal route to deliver items based on charge consumption and not discharging. I'm currently modelling it as a graph datastructure although it could be different. Are there any papers (both RL and exact method based) for state of the art that I can look into regarding this. I'm also interested in earlier papers that can help me build understanding, and I can increase complexity slowly. Thanks for the help! submitted by /u/evilBotman [link] [comments]  ( 1 min )
    [D] What is the recommended way to estimate norm for gradient clipping?
    I have read several blogs in which they specified that you should clip your gradients to the largest value that doesn't cause exploding gradients. But that also means that you could get a situation that you don't need to do gradient clipping at all. Could that hurt convergence speed in some way? Is there any guideline for how to pick this hyperparameter based on learning rate, batch size, model size, input size (seq len, img size) etc.? Or is this project/use case dependent and it represents "just another hyperparameter"? submitted by /u/Icy_Fisherman7187 [link] [comments]  ( 1 min )
    [P] HuSpaCy: Industrial-strength Hungarian NLP
    I'd like to show off a Hungarian NLP pipeline which we've been heavily improving over the past year. https://github.com/huspacy/huspacy While processing Hungarian texts might not be interesting for the most of you, I believe this project can be a good learning resource, as our models utilize the latest NLP technologies from Explosion and are fully reproducible: We've created a transformers-based model on top of a language specific BERT model Multi-task and transfer-learning is heavily used across the models. Incorporated edit-tree lemmatization and biaffine parsing from spacy-experimental We provide word embeddings using the (fastText-like) floret tool submitted by /u/oroszgy [link] [comments]  ( 1 min )
    [D] How do you decide the ranges for hyperparameters when doing a grid search or a random search?
    I'm trying to tune the parameters of two models - a random forest classifier, and a gradient boosting classifier. When using a grid search or a random search (I might also play around with the genetic algorithm), what are appropriate ranges to use / how can I come to know them? I obviously can't just specify a very arbitrary range because that might decrease the effectiveness of the random search, and make the grid search too long. Do they depend on the number of features in the dataset or other characteristics (e.g. our dataset is quite imbalanced and we're using SMOTE resampling)? I'm just trying to tune n_estimators and max_depth, but RandomForestClassifier also has many other parameters, do I just experiment with all of them, or are there any known parameters that don't do anything useful unless I'm trying for an extra 0.1% of accuracy or recall? Basically the same questions as above for the GradientBoostingClassifier. Letting go of specific models for a second, I'd appreciate generic pointers on how to deliberately tune hyperparameters or specify ranges instead of just arbitrarily specifying sth and hoping it works. Thanks! submitted by /u/stuffingmybrain [link] [comments]  ( 2 min )
    How to do meaningful work as an independent researcher? [Discussion]
    With big players like OpenAI and Google building these massive models, how does independent researchers without access to such scale and compute do meaningful work? Came across tweets from researchers, especially ones working on generative models saying they feel their work looks irrelevant after seeing results from DALL-E 2. It feels like just a couple of years ago if you had a decent GPU setup, you could pretty much do world class research. Doesn't look like it anymore. Is there, if any, research directions that makes it a level playing field where compute and scale is not necessarily the solution, or are we all doomed to be prompt engineers for GPT models? submitted by /u/HairyIndianDude [link] [comments]  ( 4 min )
  • Open

    How long before dall-e 2 (or similarly capable) produces porn ?
    Dall-e 2 has been released a few weaks and is exceptionnally capable for generating images from a text description. However training data containing porn has been filtered out as well as requests that contains nsfw terms. So is it not suited for generated porn for now. However, how long would it takes before : Someone does something as capable and allows porn OpenAI opens dall-e 2 to such content (edit)This is an open discussion.But there is clear evidence that the technology now exists to create on demand and instantly any kind of porn with any kind of kink in a lot of styles including photorealistic style. submitted by /u/exotic_deviantcy [link] [comments]
    Augmented Military Soldiers
    We often believe that future soldiers are going to be robots. There will no longer be human "boots on the ground" thanks to artificial intelligence, drones, etc. Many of us fail to realize that current soldiers, in militaries across the globe, are being augmented with AI as we speak. Things like AR headsets, computerized sights, and other technologies are going to turn them into cyborgs...Another one of the intriguing aspects of this is that major companies like Microsoft are creating these technologies for the military. How long until we see augmented soldiers going against one another? https://www.linkedin.com/pulse/augmented-soldier-mvyl-associates-1f/?trackingId=FnvaWMtxMsfm2fJeQP1bgg%3D%3D submitted by /u/IsabeldeMontoya [link] [comments]  ( 1 min )
    A Parable Of Explainability
    submitted by /u/elcric_krej [link] [comments]
    AI Dream 18 - Visual Trip through Wonderland
    submitted by /u/LordPewPew777 [link] [comments]
    Community College AI Program Needs
    I am the instructor and lead of the AI program at a community college in North Carolina. I am developing the program currently and have money to spend on Cool Educational Robotics and potentially other tools. Does anybody have any recommendations for robots to teach python with classic search, reinforcement learning methods, or other university level algorithms? A robotic arm would be nice, but I need to tie it directly to a potential project in machine learning, deep learning, reinforcement learning, search, logic, computer vision, etc. We have a NAO robot and are getting a LIMO: https://www.robotlab.com/store/limo-agilex But we could probably use an Educational Robotic Arm Thanks! Our program is one of the first Associates programs in AI in the country. Check out our program website. It is completely available online and we accept out of state students. https://www.waynecc.edu/programs/ai/ Feel free to ask any questions as well. *Also, advice on best value cloud computing virtualized resources for high performance computing for deep learning projects is appreciated. We are currently planning on going with Microsoft's Data Science Virtual Machines but not sure which series exactly submitted by /u/Wayne_CC_AI [link] [comments]  ( 1 min )
    PeopleLens AI Helps The Blind | Brain Fingerprints Detect Autism | AI Predicts Cancer Tumor Regrowth
    submitted by /u/getrich_or_diemining [link] [comments]
    Uhhhh
    submitted by /u/Blazeolmo [link] [comments]
    Dall-E 2 access
    Hello everyone, I wanted to know if one could get access to the dall-e 2 app without having to hope to be chosen from the wait list. Thanks! submitted by /u/Swaggyswaggerson [link] [comments]  ( 1 min )
    A brief history of deepfakes - how it started and where it might take us
    submitted by /u/much_successes [link] [comments]
    𝐀𝐫𝐭𝐢𝐟𝐢𝐜𝐢𝐚𝐥 𝐈𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 𝐨𝐟 𝐓𝐡𝐢𝐧𝐠𝐬 (𝐀𝐈𝐨𝐓) - 𝐋𝐚𝐭𝐞𝐬𝐭 𝐓𝐞𝐜𝐡𝐧𝐨𝐥𝐨𝐠𝐲, 𝐅𝐮𝐭𝐮𝐫𝐞 𝐔𝐩𝐜𝐨𝐦𝐢𝐧𝐠 𝐓𝐫𝐞𝐧𝐝𝐬 𝐚𝐧𝐝 𝐅𝐨𝐫𝐞𝐜𝐚𝐬𝐭 𝐭𝐨 𝟐𝟎𝟐𝟖
    Our Latest research report on the Artificial Intelligence of Things (AIoT) market shows how that market is changing and how trends in demographics, business cycles, and microeconomics affect the Artificial Intelligence of Things (AIoT) market as a whole. Our study of the global Artificial Intelligence of Things (AIoT) market demonstrates what's happening with business state by looking at production value and key regions. The market report provides an entire analysis of sales volume, pricing analysis, revenue, the margin of profit, the expansion rate within the Artificial Intelligence of Things (AIoT) market. Get A Free Sample Report @ https://www.intelligencemarketreport.com/report-sample/573982 ​ \"𝐀𝐫𝐭𝐢𝐟𝐢𝐜𝐢𝐚𝐥 𝐈𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 𝐨𝐟 𝐓𝐡𝐢𝐧𝐠𝐬 \" Key Information Extracted from the Report ​ Extensive information on factors estimated to affect the Market growth and market share during the forecast period is presented in the report. The report offers the present scenario and future growth prospects Market in various geographical regions. The competitive landscape analysis on the market as well as the qualitative and quantitative information is delivered. The SWOT analysis is conducted along with Porter's Five Force analysis. · The in-depth analysis provides an insight into the Market, underlining the growth rate and opportunities offered in the business. Leading Key Players Included In This Report Are: · Twilio Inc. · ShiftPixy Inc. · Micron Technology · Intel · IBM · Gopher Protocol · Deep Vision · Ceva · ALCES · AISPEECH submitted by /u/Purva_Duggal [link] [comments]
    Free Webinar on Automated CV pipelines | Video Predictions
    Automated CV Pipelines 4th part is open for registration. It will be covering some of the best practices for video-specific annotation tasks. If you are interested you can check out the details here! submitted by /u/WeekendClassic [link] [comments]
    Stairway to (A.I. animation + sound design)
    submitted by /u/nenomancer [link] [comments]  ( 1 min )
    Ancient civs always burn (A.I. animation + some sound design)
    submitted by /u/nenomancer [link] [comments]
    Treasure Planet made with starryai
    submitted by /u/Losthel [link] [comments]
    Elon Musk's NEURALINK vs Bryan Johnson's KERNEL (No Surgery)
    submitted by /u/1024cities [link] [comments]  ( 1 min )
    Sentiment, cognitive distortions and market data time series - causal analysis for crypto markets
    submitted by /u/akolonin [link] [comments]  ( 1 min )
    High Tech Hacks 2022 ! !
    Hey guys! I’m excited to share with you an exciting upcoming hackathon, High Tech Hacks 2.0! High Tech Hacks is a free, international 24-hour hackathon on May 21-22nd, 2022 open to all high schoolers hoping to learn a new coding skill, compete for awesome prizes, or work with other like-minded hackers. Let’s invent, create, and push the boundaries of technology (as much as we can at one hackathon)! What to expect: Last year, participants learned the basics of web development, Python, virtual reality, and how to make a Discord bot from current software engineers at Microsoft, Amazon, Twilio, other tech companies, and Columbia University SHPE. Thanks to our company sponsors, each participant last year received nearly $400 worth of free software and swag. Register to earn FREE swag (t-shirts, water bottles, stickers!) Network with other passionate STEM high school students from around the world! (Last year we had participants from 26 countries signed up already!) This year we have even bigger prizes, competitions, and speakers so stay tuned! Reach out to me with more questions or email [hightechhackathon@gmail.com](mailto:hightechhackathon@gmail.com). Happy hacking! :D Sign up here to confirm your interest and get on our mailing list: Click Here to Register! Also, meet other hackers by Joining our Discord! For more, Check out our Website submitted by /u/HighTechHacks [link] [comments]  ( 1 min )
    MyStyle: The Best AI Face Manipulation to Date!
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
  • Open

    Amazon Rekognition introduces Streaming Video Events to provide real-time alerts on live video streams
    Today, AWS announced the general availability of Amazon Rekognition Streaming Video Events, a fully managed service for camera manufacturers and service providers that uses machine learning (ML) to detect objects such as people, pets, and packages in live video streams from connected cameras. Amazon Rekognition Streaming Video Events sends them a notification as soon as […]  ( 8 min )
    3xLOGIC uses Amazon Rekognition Streaming Video Events to provide intelligent video analytics on live video streams to monitoring agents
    3xLOGIC is a leader in commercial electronic security systems. They provide commercial security systems and managed video monitoring for businesses, hospitals, schools, and government agencies. Managed video monitoring is a critical component of a comprehensive security strategy for 3xLOGIC’s customers. With more than 50,000 active cameras in the field, video monitoring teams face a daily […]  ( 4 min )
    Abode uses Amazon Rekognition Streaming Video Events to provide real-time notifications to their smart home customers
    Abode Systems (Abode) offers homeowners a comprehensive suite of do-it-yourself home security solutions that can be set up in minutes and enables homeowners to keep their family and property safe. Since the company’s launch in 2015, in-camera motion detection sensors have played an essential part in Abode’s solution, enabling customers to receive notifications and monitor […]  ( 6 min )
    Pandas user-defined functions are now available in Amazon SageMaker Data Wrangler
    Amazon SageMaker Data Wrangler reduces the time to aggregate and prepare data for machine learning (ML) from weeks to minutes. With Data Wrangler, you can select and query data with just a few clicks, quickly transform data with over 300 built-in data transformations, and understand your data with built-in visualizations without writing any code. Additionally, […]  ( 4 min )
    How Searchmetrics uses Amazon SageMaker to automatically find relevant keywords and make their human analysts 20% faster
    Searchmetrics is a global provider of search data, software, and consulting solutions, helping customers turn search data into unique business insights. To date, Searchmetrics has helped more than 1,000 companies such as McKinsey & Company, Lowe’s, and AXA find an advantage in the hyper-competitive search landscape. In 2021, Searchmetrics turned to AWS to help with […]  ( 5 min )
    Identify paraphrased text with Hugging Face on Amazon SageMaker
    Identifying paraphrased text has business value in many use cases. For example, by identifying sentence paraphrases, a text summarization system could remove redundant information. Another application is to identify plagiarized documents. In this post, we fine-tune a Hugging Face transformer on Amazon SageMaker to identify paraphrased sentence pairs in a few steps. A truly robust […]  ( 10 min )
    How Moovit turns data into insights to help passengers avoid delays using Apache Airflow and Amazon SageMaker
    This is a guest post by Moovit’s Software and Cloud Architect, Sharon Dahan. Moovit, an Intel company, is a leading Mobility as a Service (MaaS) solutions provider and creator of the top urban mobility app. Moovit serves over 1.3 billion riders in 3,500 cities around the world. We help people everywhere get to their destination […]  ( 7 min )
  • Open

    How DNEG Helped Win Another Visual-Effects Oscar by Bringing ‘Dune’ to Life With NVIDIA RTX
    Featuring stunning visuals from futuristic interstellar worlds, including colossal sand creatures, Dune captivated audiences around the world. The sci-fi film picked up six Oscars last month at the 94th Academy Awards, including for Best Sound and Visual Effects. Adapted from Frank Herbert’s 1965 novel of the same name, Dune tells the story of Paul Atreides, Read article > The post How DNEG Helped Win Another Visual-Effects Oscar by Bringing ‘Dune’ to Life With NVIDIA RTX appeared first on NVIDIA Blog.  ( 4 min )
    Your Odyssey Awaits: Stream ‘Lost Ark’ to Nearly Any Device This GFN Thursday
    It’s a jam-packed GFN Thursday. This week brings the popular, free-to-play, action role-playing game Lost Ark to gamers across nearly all their devices, streaming on GeForce NOW. And that’s not all. GFN Thursday also delivers an upgraded experience in the 2.0.40 update. M1-based MacBooks, iMacs and Mac Minis are now supported natively. Plus, membership gift Read article > The post Your Odyssey Awaits: Stream ‘Lost Ark’ to Nearly Any Device This GFN Thursday appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    Top 11 Retail Technology Trends For 2022
    The 2019 pandemic became a catalyst for retail businesses and their customers. Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 6 min )
  • Open

    How can we reduce the carbon footprint of global computing?
    Workshop hosted by MIT’s Climate and Sustainability Consortium, MIT-IBM Watson AI Lab, and the MIT Schwarzman College of Computing highlights how new approaches to computing can save energy and help the planet.  ( 8 min )
    Aging Brain Initiative awards fund five new ideas to study, fight neurodegeneration
    Competitive seed grants launch yearlong investigations of novel hypotheses about potential causes, biomarkers, treatments of Alzheimer’s and ALS.  ( 6 min )
  • Open

    Help to create a simple Neural Network... or find the bug
    Hello, I really need help creating a simple Neural Network that approximates the Rosenbrock function (a function with two variables). I have used the book "MATLAB Deep Learning" by Phil Kim, which provides code examples. However, when I make a simple example where the network is trained with backpropagation the output of the testing points simply comes out as ones. It is very early work so the code is very simple. I have just cut out the piece of code with Neural Network. Can someone give any tips or help me to figure out what is wrong? %% Neural Network % Rosenbrock function : f = (1 - x1).^2 + 100*(x2-x1.^2).^2 % X: 25x2 matrix % f_samp: 25x1 vector % Number of nodes in hidden layer nHNodes = 4 ; % Input layer has two nodes % Output layer has one node % Initial weigts W1 = 2*rand(nHNodes, k) - 1 ; W2 = 2*rand(1, nHNodes) - 1 ; alpha = 0.5 ; % Lerning rate % Backpropagation for i = 1:10000 for i = 1:size(X,1) x = X(i, :)'; d = f_samp(i); % Forward v1 = W1*x; y1 = fun_sigmoid(v1); v = W2*y1; y = fun_sigmoid(v); % Backward e = d - y; delta = y.*(1-y).*e; e1 = W2'*delta; delta1 = y1.*(1-y1).*e1; dW1 = alpha*delta1*x'; W1 = W1 + dW1; dW2 = alpha*delta*y1'; W2 = W2 + dW2; end end % Testing sampling points for i = 1:size(X,1) x = X(i, :)' ; v1 = W1*x ; y1 = fun_sigmoid(v1) ; v = W2*y1 ; y(i) = fun_sigmoid(v) ; end submitted by /u/TobiasFred [link] [comments]  ( 1 min )
    I need help making a neural network from scratch in python
    Here's my code: https://github.com/heyuhowudoin/mnist_ai I currently have 3 layers, 15, 15 and 10 neurons respectively. I'm using the MNIST database to get images of hand-drawn numbers and trying to classify them by which neuron in the final layer has the highest activation. I use a combination of sigmoid and ReLU as activation functions, and I am using minibatches of 100 pictures each. The thing I see happening is that it converges to a point where every neuron in the final layer outputs a 0 because then it has a relatively low error since the target for 9 out of 10 of those neurons is indeed 0, whereas the target for the correct output neuron is 1. I've tried looking over my backprop algorithm, I've tried changing the neurons in each hidden layer to 200, I've changed the learning speed a bunch but I can't seem to figure out why it's not working so if one of you guys wouldn't mind helping me, that would be much appreciated. submitted by /u/-i-hate-this-place- [link] [comments]  ( 1 min )
    AlphaGo Zero
    Hello, I have an alphago zero question. Why doesn’t alphago zero use Q(s,a) to choose its next move in the Monte Carlo tree search? Why does it use the π instead? submitted by /u/Skinnybisquit [link] [comments]
  • Open

    Take Your Machine Learning Skills Global
    Sponsored Post In our interconnected world, a decision made thousands of miles away can have lasting consequences for entire organizations or economies. When small changes have big effects, it is unsurprising that companies and governments are turning to machine learning and AI to accurately predict risk. ​ How the Global Community is Applying Machine Learning […] The post Take Your Machine Learning Skills Global appeared first on Machine Learning Mastery.  ( 2 min )
  • Open

    Fast Aquatic Swimmer Optimization with Differentiable Projective Dynamics and Neural Network Hydrodynamic Models. (arXiv:2204.12584v1 [cs.RO])
    Aquatic locomotion is a classic fluid-structure interaction (FSI) problem of interest to biologists and engineers. Solving the fully coupled FSI equations for incompressible Navier-Stokes and finite elasticity is computationally expensive. Optimizing robotic swimmer design within such a system generally involves cumbersome, gradient-free procedures on top of the already costly simulation. To address this challenge we present a novel, fully differentiable hybrid approach to FSI that combines a 2D direct numerical simulation for the deformable solid structure of the swimmer and a physics-constrained neural network surrogate to capture hydrodynamic effects of the fluid. For the deformable simulation of the swimmer's body, we use state-of-the-art techniques from the field of computer graphics to speed up the finite-element method (FEM). For the fluid simulation, we use a U-Net architecture trained with a physics-based loss function to predict the flow field at each time step. The pressure and velocity field outputs from the neural network are sampled around the boundary of our swimmer using an immersed boundary method (IBM) to compute its swimming motion accurately and efficiently. We demonstrate the computational efficiency and differentiability of our hybrid simulator on a 2D carangiform swimmer. Since both the solid simulator and the hydrodynamics model are automatically differentiable, we obtain a fully differentiable FSI simulator that can be used for computational co-design of geometry and controls for rigid and soft bodies immersed in fluids, such as minimizing drag, maximizing speed, or maximizing efficiency via direct gradient-based optimization.  ( 2 min )
    Zero-Touch Network on Industrial IoT: An End-to-End Machine Learning Approach. (arXiv:2204.12605v1 [cs.LG])
    Industry 4.0-enabled smart factory is expected to realize the next revolution for manufacturers. Although artificial intelligence (AI) technologies have improved productivity, current use cases belong to small-scale and single-task operations. To unbound the potential of smart factory, this paper develops zero-touch network systems for intelligent manufacturing and facilitates distributed AI applications in both training and inferring stages in a large-scale manner. The open radio access network (O-RAN) architecture is first introduced for the zero-touch platform to enable globally controlling communications and computation infrastructure capability in the field. The designed serverless framework allows intelligent and efficient learning assignments and resource allocations. Hence, requested learning tasks can be assigned to appropriate robots, and the underlying infrastructure can be used to support the learning tasks without expert knowledge. Moreover, due to the proposed network system's flexibility, powerful AI-enabled networking algorithms can be utilized to ensure service-level agreements and superior performances for factory workloads. Finally, three open research directions of backward compatibility, end-to-end enhancements, and cybersecurity are discussed for zero-touch smart factory.  ( 2 min )
    Data-driven detector signal characterization with constrained bottleneck autoencoders. (arXiv:2203.04604v4 [physics.ins-det] UPDATED)
    A common technique in high energy physics is to characterize the response of a detector by means of models tunned to data which build parametric maps from the physical parameters of the system to the expected signal of the detector. When the underlying model is unknown it is difficult to apply this method, and often, simplifying assumptions are made introducing modeling errors. In this article, using a waveform toy model we present how deep learning in the form of constrained bottleneck autoencoders can be used to learn the underlying unknown detector response model directly from data. The results show that excellent performance results can be achieved even when the signals are significantly affected by random noise. The trained algorithm can be used simultaneously to perform estimations on the physical parameters of the model, simulate the detector response with high fidelity and to denoise detector signals.  ( 2 min )
    A Survey on Machine Learning Approaches for Modelling Intuitive Physics. (arXiv:2202.06481v2 [cs.LG] UPDATED)
    Research in cognitive science has provided extensive evidence of human cognitive ability in performing physical reasoning of objects from noisy perceptual inputs. Such a cognitive ability is commonly known as intuitive physics. With advancements in deep learning, there is an increasing interest in building intelligent systems that are capable of performing physical reasoning from a given scene for the purpose of building better AI systems. As a result, many contemporary approaches in modelling intuitive physics for machine cognition have been inspired by literature from cognitive science. Despite the wide range of work in physical reasoning for machine cognition, there is a scarcity of reviews that organize and group these deep learning approaches. Especially at the intersection of intuitive physics and artificial intelligence, there is a need to make sense of the diverse range of ideas and approaches. Therefore, this paper presents a comprehensive survey of recent advances and techniques in intuitive physics-inspired deep learning approaches for physical reasoning. The survey will first categorize existing deep learning approaches into three facets of physical reasoning before organizing them into three general technical approaches and propose six categorical tasks of the field. Finally, we highlight the challenges of the current field and present some future research directions.  ( 2 min )
    Generating Examples From CLI Usage: Can Transformers Help?. (arXiv:2204.12648v1 [cs.SE])
    Continuous evolution in modern software often causes documentation, tutorials, and examples to be out of sync with changing interfaces and frameworks. Relying on outdated documentation and examples can lead programs to fail or be less efficient or even less secure. In response, programmers need to regularly turn to other resources on the web such as StackOverflow for examples to guide them in writing software. We recognize that this inconvenient, error-prone, and expensive process can be improved by using machine learning applied to software usage data. In this paper, we present our practical system which uses machine learning on large-scale telemetry data and documentation corpora, generating appropriate and complex examples that can be used to improve documentation. We discuss both feature-based and transformer-based machine learning approaches and demonstrate that our system achieves 100% coverage for the used functionalities in the product, providing up-to-date examples upon every release and reduces the numbers of PRs submitted by software owners writing and editing documentation by >68%. We also share valuable lessons learnt during the 3 years that our production quality system has been deployed for Azure Cloud Command Line Interface (Azure CLI).  ( 2 min )
    Generating Self-Serendipity Preference in Recommender Systems for Addressing Cold Start Problems. (arXiv:2204.12651v1 [cs.IR])
    Classical accuracy-oriented Recommender Systems (RSs) typically face the cold-start problem and the filter-bubble problem when users suffer the familiar, repeated, and even predictable recommendations, making them boring and unsatisfied. To address the above issues, serendipity-oriented RSs are proposed to recommend appealing and valuable items significantly deviating from users' historical interactions and thus satisfying them by introducing unexplored but relevant candidate items to them. In this paper, we devise a novel serendipity-oriented recommender system (\textbf{G}enerative \textbf{S}elf-\textbf{S}erendipity \textbf{R}ecommender \textbf{S}ystem, \textbf{GS$^2$-RS}) that generates users' self-serendipity preferences to enhance the recommendation performance. Specifically, this model extracts users' interest and satisfaction preferences, generates virtual but convincible neighbors' preferences from themselves, and achieves their self-serendipity preference. Then these preferences are injected into the rating matrix as additional information for RS models. Note that GS$^2$-RS can not only tackle the cold-start problem but also provides diverse but relevant recommendations to relieve the filter-bubble problem. Extensive experiments on benchmark datasets illustrate that the proposed GS$^2$-RS model can significantly outperform the state-of-the-art baseline approaches in serendipity measures with a stable accuracy performance.  ( 2 min )
    AI-Bind: Improving Binding Predictions for Novel Protein Targets and Ligands. (arXiv:2112.13168v4 [q-bio.QM] UPDATED)
    Identifying novel drug-target interactions (DTI) is a critical and rate limiting step in drug discovery. While deep learning models have been proposed to accelerate the identification process, we show that state-of-the-art models fail to generalize to novel (i.e., never-before-seen) structures. We first unveil the mechanisms responsible for this shortcoming, demonstrating how models rely on shortcuts that leverage the topology of the protein-ligand bipartite network, rather than learning the node features. Then, we introduce AI-Bind, a pipeline that combines network-based sampling strategies with unsupervised pre-training, allowing us to limit the annotation imbalance and improve binding predictions for novel proteins and ligands. We illustrate the value of AI-Bind by predicting drugs and natural compounds with binding affinity to SARS-CoV-2 viral proteins and the associated human proteins. We also validate these predictions via auto-docking simulations and comparison with recent experimental evidence, and step up the process of interpreting machine learning prediction of protein-ligand binding by identifying potential active binding sites on the amino acid sequence. Overall, AI-Bind offers a powerful high-throughput approach to identify drug-target combinations, with the potential of becoming a powerful tool in drug discovery.  ( 2 min )
    Trusted Multi-View Classification with Dynamic Evidential Fusion. (arXiv:2204.11423v2 [cs.LG] UPDATED)
    Existing multi-view classification algorithms focus on promoting accuracy by exploiting different views, typically integrating them into common representations for follow-up tasks. Although effective, it is also crucial to ensure the reliability of both the multi-view integration and the final decision, especially for noisy, corrupted and out-of-distribution data. Dynamically assessing the trustworthiness of each view for different samples could provide reliable integration. This can be achieved through uncertainty estimation. With this in mind, we propose a novel multi-view classification algorithm, termed trusted multi-view classification (TMC), providing a new paradigm for multi-view learning by dynamically integrating different views at an evidence level. The proposed TMC can promote classification reliability by considering evidence from each view. Specifically, we introduce the variational Dirichlet to characterize the distribution of the class probabilities, parameterized with evidence from different views and integrated with the Dempster-Shafer theory. The unified learning framework induces accurate uncertainty and accordingly endows the model with both reliability and robustness against possible noise or corruption. Both theoretical and experimental results validate the effectiveness of the proposed model in accuracy, robustness and trustworthiness.  ( 2 min )
    An Empirical Study of the Occurrence of Heavy-Tails in Training a ReLU Gate. (arXiv:2204.12554v1 [cs.LG])
    A particular direction of recent advance about stochastic deep-learning algorithms has been about uncovering a rather mysterious heavy-tailed nature of the stationary distribution of these algorithms, even when the data distribution is not so. Moreover, the heavy-tail index is known to show interesting dependence on the input dimension of the net, the mini-batch size and the step size of the algorithm. In this short note, we undertake an experimental study of this index for S.G.D. while training a $\relu$ gate (in the realizable and in the binary classification setup) and for a variant of S.G.D. that was proven in Karmakar and Mukherjee (2022) for ReLU realizable data. From our experiments we conjecture that these two algorithms have similar heavy-tail behaviour on any data where the latter can be proven to converge. Secondly, we demonstrate that the heavy-tail index of the late time iterates in this model scenario has strikingly different properties than either what has been proven for linear hypothesis classes or what has been previously demonstrated for large nets.  ( 2 min )
    Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models. (arXiv:2104.05158v6 [cs.DC] UPDATED)
    Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.  ( 3 min )
    Neural Collapse Inspired Attraction-Repulsion-Balanced Loss for Imbalanced Learning. (arXiv:2204.08735v2 [cs.LG] UPDATED)
    Class imbalance distribution widely exists in real-world engineering. However, the mainstream optimization algorithms that seek to minimize error will trap the deep learning model in sub-optimums when facing extreme class imbalance. It seriously harms the classification precision, especially on the minor classes. The essential reason is that the gradients of the classifier weights are imbalanced among the components from different classes. In this paper, we propose Attraction-Repulsion-Balanced Loss (ARB-Loss) to balance the different components of the gradients. We perform experiments on the large-scale classification and segmentation datasets and our ARB-Loss can achieve state-of-the-art performance via only one-stage training instead of 2-stage learning like nowadays SOTA works.  ( 2 min )
    Performer: A Novel PPG to ECG Reconstruction Transformer For a Digital Biomarker of Cardiovascular Disease Detection. (arXiv:2204.11795v2 [eess.SP] UPDATED)
    Cardiovascular diseases (CVDs) have become the top one cause of death; three-quarters of these deaths occur in lower-income communities. Electrocardiography (ECG), an electrical measurement capturing the cardiac activities, is a gold-standard to diagnose CVDs. However, ECG is infeasible for continuous cardiac monitoring due to its requirement for user participation. Meanwhile, photoplethysmography (PPG) is easy to collect, but the limited accuracy constrains its clinical usage. In this research, a novel Transformer-based architecture, Performer, is invented to reconstruct ECG from PPG and to create a novel digital biomarker, PPG along with its reconstructed ECG, as multiple modalities for CVD detection. This architecture, for the first time, performs Transformer sequence to sequence translation on biomedical waveforms, while also utilizing the advantages of the easily accessible PPG and the well-studied base of ECG. Shifted Patch-based Attention (SPA) is created to maximize the signal features by fetching the various sequence lengths as hierarchical stages into the training while also capturing cross-patch connections through the shifted patch mechanism. This architecture generates a state-of-the-art performance of 0.29 RMSE for reconstructing ECG from PPG, achieving an average of 95.9% diagnosis for CVDs on the MIMIC III dataset and 75.9% for diabetes on the PPG-BP dataset. Performer, along with its novel digital biomarker, offers a low-cost and non-invasive solution for continuous cardiac monitoring, only requiring the easily extractable PPG data to reconstruct the not-as-accessible ECG data. As a prove of concept, an earring wearable, named PEARL (prototype), is designed to scale up the point-of-care (POC) healthcare system.  ( 2 min )
    An empirical study of the effect of background data size on the stability of SHapley Additive exPlanations (SHAP) for deep learning models. (arXiv:2204.11351v2 [cs.LG] UPDATED)
    Nowadays, the interpretation of why a machine learning (ML) model makes certain inferences is as crucial as the accuracy of such inferences. Some ML models like the decision tree possess inherent interpretability that can be directly comprehended by humans. Others like artificial neural networks (ANN), however, rely on external methods to uncover the deduction mechanism. SHapley Additive exPlanations (SHAP) is one of such external methods, which requires a background dataset when interpreting ANNs. Generally, a background dataset consists of instances randomly sampled from the training dataset. However, the sampling size and its effect on SHAP remain to be unexplored. In our empirical study on the MIMIC-III dataset, we show that the two core explanations - SHAP values and variable rankings fluctuate when using different background datasets acquired from random sampling, indicating that users cannot unquestioningly trust the one-shot interpretation from SHAP. Luckily, such fluctuation decreases with the increase of the background dataset size. Also, we notice an U-shape in the stability assessment of SHAP variable rankings, demonstrating that SHAP is more reliable in ranking the most and least important variables compared to moderately important ones. Overall, our results suggest that users should take into account how background data affects SHAP results, with improved SHAP stability as the background sample size increases.  ( 2 min )
    Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines. (arXiv:2204.11131v2 [cs.LG] UPDATED)
    Developing modern machine learning (ML) applications is data-centric, of which one fundamental challenge is to understand the influence of data quality to ML training -- "Which training examples are 'guilty' in making the trained ML model predictions inaccurate or unfair?" Modeling data influence for ML training has attracted intensive interest over the last decade, and one popular framework is to compute the Shapley value of each training example with respect to utilities such as validation accuracy and fairness of the trained ML model. Unfortunately, despite recent intensive interest and research, existing methods only consider a single ML model "in isolation" and do not consider an end-to-end ML pipeline that consists of data transformations, feature extractors, and ML training. We present DataScope (ease.ml/datascope), the first system that efficiently computes Shapley values of training examples over an end-to-end ML pipeline, and illustrate its applications in data debugging for ML training. To this end, we first develop a novel algorithmic framework that computes Shapley value over a specific family of ML pipelines that we call canonical pipelines: a positive relational algebra query followed by a K-nearest-neighbor (KNN) classifier. We show that, for many subfamilies of canonical pipelines, computing Shapley value is in PTIME, contrasting the exponential complexity of computing Shapley value in general. We then put this to practice -- given an sklearn pipeline, we approximate it with a canonical pipeline to use as a proxy. We conduct extensive experiments illustrating different use cases and utilities. Our results show that DataScope is up to four orders of magnitude faster over state-of-the-art Monte Carlo-based methods, while being comparably, and often even more, effective in data debugging.  ( 2 min )
    Reinforced Causal Explainer for Graph Neural Networks. (arXiv:2204.11028v2 [cs.LG] UPDATED)
    Explainability is crucial for probing graph neural networks (GNNs), answering questions like "Why the GNN model makes a certain prediction?". Feature attribution is a prevalent technique of highlighting the explanatory subgraph in the input graph, which plausibly leads the GNN model to make its prediction. Various attribution methods exploit gradient-like or attention scores as the attributions of edges, then select the salient edges with top attribution scores as the explanation. However, most of these works make an untenable assumption - the selected edges are linearly independent - thus leaving the dependencies among edges largely unexplored, especially their coalition effect. We demonstrate unambiguous drawbacks of this assumption - making the explanatory subgraph unfaithful and verbose. To address this challenge, we propose a reinforcement learning agent, Reinforced Causal Explainer (RC-Explainer). It frames the explanation task as a sequential decision process - an explanatory subgraph is successively constructed by adding a salient edge to connect the previously selected subgraph. Technically, its policy network predicts the action of edge addition, and gets a reward that quantifies the action's causal effect on the prediction. Such reward accounts for the dependency of the newly-added edge and the previously-added edges, thus reflecting whether they collaborate together and form a coalition to pursue better explanations. As such, RC-Explainer is able to generate faithful and concise explanations, and has a better generalization power to unseen graphs. When explaining different GNNs on three graph classification datasets, RC-Explainer achieves better or comparable performance to SOTA approaches w.r.t. predictive accuracy and contrastivity, and safely passes sanity checks and visual inspections. Codes are available at https://github.com/xiangwang1223/reinforced_causal_explainer.  ( 2 min )
    Long-term Spatio-temporal Forecasting via Dynamic Multiple-Graph Attention. (arXiv:2204.11008v2 [cs.LG] UPDATED)
    Many real-world ubiquitous applications, such as parking recommendations and air pollution monitoring, benefit significantly from accurate long-term spatio-temporal forecasting (LSTF). LSTF makes use of long-term dependency between spatial and temporal domains, contextual information, and inherent pattern in the data. Recent studies have revealed the potential of multi-graph neural networks (MGNNs) to improve prediction performance. However, existing MGNN methods cannot be directly applied to LSTF due to several issues: the low level of generality, insufficient use of contextual information, and the imbalanced graph fusion approach. To address these issues, we construct new graph models to represent the contextual information of each node and the long-term spatio-temporal data dependency structure. To fuse the information across multiple graphs, we propose a new dynamic multi-graph fusion module to characterize the correlations of nodes within a graph and the nodes across graphs via the spatial attention and graph attention mechanisms. Furthermore, we introduce a trainable weight tensor to indicate the importance of each node in different graphs. Extensive experiments on two large-scale datasets demonstrate that our proposed approaches significantly improve the performance of existing graph neural network models in LSTF prediction tasks.  ( 2 min )
    Sublinear Time Approximation of Text Similarity Matrices. (arXiv:2112.09631v3 [cs.LG] UPDATED)
    We study algorithms for approximating pairwise similarity matrices that arise in natural language processing. Generally, computing a similarity matrix for $n$ data points requires $\Omega(n^2)$ similarity computations. This quadratic scaling is a significant bottleneck, especially when similarities are computed via expensive functions, e.g., via transformer models. Approximation methods reduce this quadratic complexity, often by using a small subset of exactly computed similarities to approximate the remainder of the complete pairwise similarity matrix. Significant work focuses on the efficient approximation of positive semidefinite (PSD) similarity matrices, which arise e.g., in kernel methods. However, much less is understood about indefinite (non-PSD) similarity matrices, which often arise in NLP. Motivated by the observation that many of these matrices are still somewhat close to PSD, we introduce a generalization of the popular Nystr\"{o}m method to the indefinite setting. Our algorithm can be applied to any similarity matrix and runs in sublinear time in the size of the matrix, producing a rank-$s$ approximation with just $O(ns)$ similarity computations. We show that our method, along with a simple variant of CUR decomposition, performs very well in approximating a variety of similarity matrices arising in NLP tasks. We demonstrate high accuracy of the approximated similarity matrices in the downstream tasks of document classification, sentence similarity, and cross-document coreference.  ( 2 min )
    AstBERT: Enabling Language Model for Financial Code Understanding with Abstract Syntax Trees. (arXiv:2201.07984v2 [cs.AI] UPDATED)
    Using the pre-trained language model (i.e. BERT) to apprehend source codes has attracted increasing attention from financial institutions owing to the great potential to uncover financial risks. However, there are several challenges in applying these language models to directly solve programming language (PL) related problems. To this end, we propose the AstBERT model, a pre-trained language model aiming to better understand the financial PL using the abstract syntax tree (AST). Specifically, we collect a colossal amount of source codes (both Java and Python) from the Alipay code repository and incorporate both syntactic and semantic code knowledge into our model through the help of code parsers, in which AST information of the source codes can be interpreted and integrated. We evaluate the performance of the proposed model on three tasks, including code question answering, code clone detection and code refinement. Experiment results show that our AstBERT achieves promising performance on three downstream tasks.  ( 2 min )
    ROMNet: Renovate the Old Memories. (arXiv:2202.02606v2 [eess.IV] UPDATED)
    Renovating the memories in old photos is an intriguing research topic in computer vision fields. These legacy images often suffer from severe and commingled degradations such as cracks, noise, and color-fading, while lack of large-scale paired old photo datasets makes this restoration task very challenging. In this work, we present a novel reference-based end-to-end learning framework that can jointly repair and colorize the degraded legacy pictures. Specifically, the proposed framework consists of three modules: a restoration sub-network for degradation restoration, a similarity sub-network for color histogram matching and transfer, and a colorization subnet that learns to predict the chroma elements of the images conditioned on chromatic reference signals. The whole system takes advantage of the color histogram priors in a given reference image, which vastly reduces the dependency on large-scale training data. Apart from the proposed method, we also create, to our knowledge, the first public and real-world old photo dataset with paired ground truth for evaluating old photo restoration models, wherein each old photo is paired with a manually restored pristine image by PhotoShop experts. Our extensive experiments conducted on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-arts both quantitatively and qualitatively.  ( 2 min )
    LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval. (arXiv:2203.06169v2 [cs.CL] UPDATED)
    In this paper, we propose LaPraDoR, a pretrained dual-tower dense retriever that does not require any supervised data for training. Specifically, we first present Iterative Contrastive Learning (ICoL) that iteratively trains the query and document encoders with a cache mechanism. ICoL not only enlarges the number of negative instances but also keeps representations of cached examples in the same hidden space. We then propose Lexicon-Enhanced Dense Retrieval (LEDR) as a simple yet effective way to enhance dense retrieval with lexical matching. We evaluate LaPraDoR on the recently proposed BEIR benchmark, including 18 datasets of 9 zero-shot text retrieval tasks. Experimental results show that LaPraDoR achieves state-of-the-art performance compared with supervised dense retrieval models, and further analysis reveals the effectiveness of our training strategy and objectives. Compared to re-ranking, our lexicon-enhanced approach can be run in milliseconds (22.5x faster) while achieving superior performance.  ( 2 min )
    Accelerated Proximal Alternating Gradient-Descent-Ascent for Nonconvex Minimax Machine Learning. (arXiv:2112.11663v6 [cs.LG] UPDATED)
    Alternating gradient-descent-ascent (AltGDA) is an optimization algorithm that has been widely used for model training in various machine learning applications, which aims to solve a nonconvex minimax optimization problem. However, the existing studies show that it suffers from a high computation complexity in nonconvex minimax optimization. In this paper, we develop a single-loop and fast AltGDA-type algorithm that leverages proximal gradient updates and momentum acceleration to solve regularized nonconvex minimax optimization problems. By leveraging the momentum acceleration technique, we prove that the algorithm converges to a critical point in nonconvex minimax optimization and achieves a computation complexity in the order of $\mathcal{O}(\kappa^{\frac{11}{6}}\epsilon^{-2})$, where $\epsilon$ is the desired level of accuracy and $\kappa$ is the problem's condition number. {Such a computation complexity improves the state-of-the-art complexities of single-loop GDA and AltGDA algorithms (see the summary of comparison in \Cref{table1})}. We demonstrate the effectiveness of our algorithm via an experiment on adversarial deep learning.  ( 2 min )
    Closing the Gap between Single-User and Multi-User VoiceFilter-Lite. (arXiv:2202.12169v2 [eess.AS] UPDATED)
    VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers. However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating a dual learning rate schedule and by using feature-wise linear modulation (FiLM) to condition the model with the attended speaker embedding, we successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations. At the same time, the new model can also be easily extended to support any number of users, and significantly outperforms our previously published model on multi-speaker evaluations.  ( 2 min )
    GNMR: A provable one-line algorithm for low rank matrix recovery. (arXiv:2106.12933v3 [math.OC] UPDATED)
    Low rank matrix recovery problems, including matrix completion and matrix sensing, appear in a broad range of applications. In this work we present GNMR -- an extremely simple iterative algorithm for low rank matrix recovery, based on a Gauss-Newton linearization. On the theoretical front, we derive recovery guarantees for GNMR in both the matrix sensing and matrix completion settings. Some of these results improve upon the best currently known for other methods. A key property of GNMR is that it implicitly keeps the factor matrices approximately balanced throughout its iterations. On the empirical front, we show that for matrix completion with uniform sampling, GNMR performs better than several popular methods, especially when given very few observations close to the information limit.  ( 2 min )
    Differentially Private SGDA for Minimax Problems. (arXiv:2201.09046v3 [cs.LG] UPDATED)
    Stochastic gradient descent ascent (SGDA) and its variants have been the workhorse for solving minimax problems. However, in contrast to the well-studied stochastic gradient descent (SGD) with differential privacy (DP) constraints, there is little work on understanding the generalization (utility) of SGDA with DP constraints. In this paper, we use the algorithmic stability approach to establish the generalization (utility) of DP-SGDA in different settings. In particular, for the convex-concave setting, we prove that the DP-SGDA can achieve an optimal utility rate in terms of the weak primal-dual population risk in both smooth and non-smooth cases. To our best knowledge, this is the first-ever-known result for DP-SGDA in the non-smooth case. We further provide its utility analysis in the nonconvex-strongly-concave setting which is the first-ever-known result in terms of the primal population risk. The convergence and generalization results for this nonconvex setting are new even in the non-private setting. Finally, numerical experiments are conducted to demonstrate the effectiveness of DP-SGDA for both convex and nonconvex cases.  ( 2 min )
    ISNet: Costless and Implicit Image Segmentation for Deep Classifiers, with Application in COVID-19 Detection. (arXiv:2202.00232v3 [eess.IV] UPDATED)
    This work proposes a novel deep neural network (DNN) architecture, Implicit Segmentation Neural Network (ISNet), to solve the task of image segmentation followed by classification. It substitutes the common pipeline of two DNNs with a single model. We designed the ISNet for high flexibility and performance: it allows virtually any classification neural network architecture to analyze a common image as if it had been previously segmented. Furthermore, in relation to the unmodified classifier, the ISNet does not cause any increment in computational cost at run-time. We test the architecture with two applications: COVID-19 detection in chest X-rays, and facial attribute estimation. We implement an ISNet based on a DenseNet121 classifier, and compare the model to a U-net (performing lung/face segmentation) followed by a DenseNet121, and to a standalone DenseNet121. The new architecture matched the other DNNs in facial attribute estimation. Moreover, it strongly surpassed them in COVID-19 detection, according to an external test dataset. The ISNet precisely ignored the image regions outside of the lungs or faces. Therefore, in COVID-19 detection it reduced the effects of background bias and shortcut learning, and it improved security in facial attribute estimation. ISNet presents an accurate, fast, and light methodology. The successful implicit segmentation, considering two largely diverse fields, highlights the architecture's general applicability.  ( 3 min )
    Variational Learning for Unsupervised Knowledge Grounded Dialogs. (arXiv:2112.00653v3 [cs.CL] UPDATED)
    Recent methods for knowledge grounded dialogs generate responses by incorporating information from an external textual document. These methods do not require the exact document to be known during training and rely on the use of a retrieval system to fetch relevant documents from a large index. The documents used to generate the responses are modeled as latent variables whose prior probabilities need to be estimated. Models such as RAG and REALM, marginalize the document probabilities over the documents retrieved from the index to define the log likelihood loss function which is optimized end-to-end. In this paper, we develop a variational approach to the above technique wherein, we instead maximize the Evidence Lower bound (ELBO). Using a collection of three publicly available open-conversation datasets, we demonstrate how the posterior distribution, that has information from the ground-truth response, allows for a better approximation of the objective function during training. To overcome the challenges associated with sampling over a large knowledge collection, we develop an efficient approach to approximate the ELBO. To the best of our knowledge we are the first to apply variational training for open-scale unsupervised knowledge grounded dialog systems.  ( 2 min )
    Single-pass Object-adaptive Data Undersampling and Reconstruction for MRI. (arXiv:2111.09212v2 [eess.IV] UPDATED)
    There is much recent interest in techniques to accelerate the data acquisition process in MRI by acquiring limited measurements. Often sophisticated reconstruction algorithms are deployed to maintain high image quality in such settings. In this work, we propose a data-driven sampler using a convolutional neural network, MNet, to provide object-specific sampling patterns adaptive to each scanned object. The network observes very limited low-frequency k-space data for each object and rapidly predicts the desired undersampling pattern in one go that achieves high image reconstruction quality. We propose an accompanying alternating-type training framework with a mask-backward procedure that efficiently generates training labels for the sampler network and jointly trains an image reconstruction network. Experimental results on the fastMRI knee dataset demonstrate the ability of the proposed learned undersampling network to generate object-specific masks at fourfold and eightfold acceleration that achieve superior image reconstruction performance than several existing schemes. The source code for the proposed joint sampling and reconstruction learning framework is available at https://github.com/zhishenhuang/mri.  ( 2 min )
    Efficient Learning of the Parameters of Non-Linear Models using Differentiable Resampling in Particle Filters. (arXiv:2111.01409v2 [stat.ML] UPDATED)
    It has been widely documented that the sampling and resampling steps in particle filters cannot be differentiated. The {\itshape reparameterisation trick} was introduced to allow the sampling step to be reformulated into a differentiable function. We extend the {\itshape reparameterisation trick} to include the stochastic input to resampling therefore limiting the discontinuities in the gradient calculation after this step. Knowing the gradients of the prior and likelihood allows us to run particle Markov Chain Monte Carlo (p-MCMC) and use the No-U-Turn Sampler (NUTS) as the proposal when estimating parameters. We compare the Metropolis-adjusted Langevin algorithm (MALA), Hamiltonian Monte Carlo with different number of steps and NUTS. We consider two state-space models and show that NUTS improves the mixing of the Markov chain and can produce more accurate results in less computational time.  ( 2 min )
    Building separable approximations for quantum states via neural networks. (arXiv:2112.08055v4 [quant-ph] UPDATED)
    Finding the closest separable state to a given target state is a notoriously difficult task, even more difficult than deciding whether a state is entangled or separable. To tackle this task, we parametrize separable states with a neural network and train it to minimize the distance to a given target state, with respect to a differentiable distance, such as the trace distance or Hilbert--Schmidt distance. By examining the output of the algorithm, we obtain an upper bound on the entanglement of the target state, and construct an approximation for its closest separable state. We benchmark the method on a variety of well-known classes of bipartite states and find excellent agreement, even up to local dimension of $d=10$, while providing conjectures and analytic insight for isotropic and Werner states. Moreover, we show our method to be efficient in the multipartite case, considering different notions of separability. Examining three and four-party GHZ and W states we recover known bounds and obtain novel ones, for instance for triseparability.  ( 2 min )
    Automatic Synthesis of Diverse Weak Supervision Sources for Behavior Analysis. (arXiv:2111.15186v2 [cs.LG] UPDATED)
    Obtaining annotations for large training sets is expensive, especially in settings where domain knowledge is required, such as behavior analysis. Weak supervision has been studied to reduce annotation costs by using weak labels from task-specific labeling functions (LFs) to augment ground truth labels. However, domain experts still need to hand-craft different LFs for different tasks, limiting scalability. To reduce expert effort, we present AutoSWAP: a framework for automatically synthesizing data-efficient task-level LFs. The key to our approach is to efficiently represent expert knowledge in a reusable domain-specific language and more general domain-level LFs, with which we use state-of-the-art program synthesis techniques and a small labeled dataset to generate task-level LFs. Additionally, we propose a novel structural diversity cost that allows for efficient synthesis of diverse sets of LFs, further improving AutoSWAP's performance. We evaluate AutoSWAP in three behavior analysis domains and demonstrate that AutoSWAP outperforms existing approaches using only a fraction of the data. Our results suggest that AutoSWAP is an effective way to automatically generate LFs that can significantly reduce expert effort for behavior analysis.  ( 2 min )
    Regularized Newton Method with Global $O(1/k^2)$ Convergence. (arXiv:2112.02089v2 [math.OC] UPDATED)
    We present a Newton-type method that converges fast from any initialization and for arbitrary convex objectives with Lipschitz Hessians. We achieve this by merging the ideas of cubic regularization with a certain adaptive Levenberg--Marquardt penalty. In particular, we show that the iterates given by $x^{k+1}=x^k - \bigl(\nabla^2 f(x^k) + \sqrt{H\|\nabla f(x^k)\|} \mathbf{I}\bigr)^{-1}\nabla f(x^k)$, where $H>0$ is a constant, converge globally with a $\mathcal{O}(\frac{1}{k^2})$ rate. Our method is the first variant of Newton's method that has both cheap iterations and provably fast global convergence. Moreover, we prove that locally our method converges superlinearly when the objective is strongly convex. To boost the method's performance, we present a line search procedure that does not need hyperparameters and is provably efficient.  ( 2 min )
    Physics-Driven Learning of Wasserstein GAN for Density Reconstruction in Dynamic Tomography. (arXiv:2110.15424v2 [eess.IV] UPDATED)
    Object density reconstruction from projections containing scattered radiation and noise is of critical importance in many applications. Existing scatter correction and density reconstruction methods may not provide the high accuracy needed in many applications and can break down in the presence of unmodeled or anomalous scatter and other experimental artifacts. Incorporating machine-learned models could prove beneficial for accurate density reconstruction particularly in dynamic imaging, where the time-evolution of the density fields could be captured by partial differential equations or by learning from hydrodynamics simulations. In this work, we demonstrate the ability of learned deep neural networks to perform artifact removal in noisy density reconstructions, where the noise is imperfectly characterized. We use a Wasserstein generative adversarial network (WGAN), where the generator serves as a denoiser that removes artifacts in densities obtained from traditional reconstruction algorithms. We train the networks from large density time-series datasets, with noise simulated according to parametric random distributions that may mimic noise in experiments. The WGAN is trained with noisy density frames as generator inputs, to match the generator outputs to the distribution of clean densities (time-series) from simulations. A supervised loss is also included in the training, which leads to improved density restoration performance. In addition, we employ physics-based constraints such as mass conservation during network training and application to further enable highly accurate density reconstructions. Our preliminary numerical results show that the models trained in our frameworks can remove significant portions of unknown noise in density time-series data.  ( 2 min )
    A Study of Fake News Reading and Annotating in Social Media Context. (arXiv:2109.12523v2 [cs.HC] UPDATED)
    The online spreading of fake news is a major issue threatening entire societies. Much of this spreading is enabled by new media formats, namely social networks and online media sites. Researchers and practitioners have been trying to answer this by characterizing the fake news and devising automated methods for detecting them. The detection methods had so far only limited success, mostly due to the complexity of the news content and context and lack of properly annotated datasets. One possible way to boost the efficiency of automated misinformation detection methods, is to imitate the detection work of humans. It is also important to understand the news consumption behavior of online users. In this paper, we present an eye-tracking study, in which we let 44 lay participants to casually read through a social media feed containing posts with news articles, some of which were fake. In a second run, we asked the participants to decide on the truthfulness of these articles. We also describe a follow-up qualitative study with a similar scenario but this time with 7 expert fake news annotators. We present the description of both studies, characteristics of the resulting dataset (which we hereby publish) and several findings.  ( 2 min )
    Encoding Involutory Invariances in Neural Networks. (arXiv:2106.12891v2 [cs.LG] UPDATED)
    In certain situations, neural networks are trained upon data that obey underlying symmetries. However, the predictions do not respect the symmetries exactly unless embedded in the network structure. In this work, we introduce architectures that embed a special kind of symmetry namely, invariance with respect to involutory linear/affine transformations up to parity $p=\pm 1$. We provide rigorous theorems to show that the proposed network ensures such an invariance and present qualitative arguments for a special universal approximation theorem. An adaption of our techniques to CNN tasks for datasets with inherent horizontal/vertical reflection symmetry is demonstrated. Extensive experiments indicate that the proposed model outperforms baseline feed-forward and physics-informed neural networks while identically respecting the underlying symmetry.  ( 2 min )
    Optimal Epidemic Control as a Contextual Combinatorial Bandit with Budget. (arXiv:2106.15808v2 [cs.LG] UPDATED)
    In light of the COVID-19 pandemic, it is an open challenge and critical practical problem to find a optimal way to dynamically prescribe the best policies that balance both the governmental resources and epidemic control in different countries and regions. To solve this multi-dimensional tradeoff of exploitation and exploration, we formulate this technical challenge as a contextual combinatorial bandit problem that jointly optimizes a multi-criteria reward function. Given the historical daily cases in a region and the past intervention plans in place, the agent should generate useful intervention plans that policy makers can implement in real time to minimizing both the number of daily COVID-19 cases and the stringency of the recommended interventions. We prove this concept with simulations of multiple realistic policy making scenarios and demonstrate a clear advantage in providing a pareto optimal solution in the epidemic intervention problem.  ( 2 min )
    Leveraging power grid topology in machine learning assisted optimal power flow. (arXiv:2110.00306v3 [cs.LG] UPDATED)
    Machine learning assisted optimal power flow (OPF) aims to reduce the computational complexity of these non-linear and non-convex constrained optimization problems by consigning expensive (online) optimization to offline training. The majority of work in this area typically employs fully connected neural networks (FCNN). However, recently convolutional (CNN) and graph (GNN) neural networks have also been investigated, in effort to exploit topological information within the power grid. Although promising results have been obtained, there lacks a systematic comparison between these architectures throughout literature. Accordingly, we introduce a concise framework for generalizing methods for machine learning assisted OPF and assess the performance of a variety of FCNN, CNN and GNN models for two fundamental approaches in this domain: regression (predicting optimal generator set-points) and classification (predicting the active set of constraints). For several synthetic power grids with interconnected utilities, we show that locality properties between feature and target variables are scarce and subsequently demonstrate marginal utility of applying CNN and GNN architectures compared to FCNN for a fixed grid topology. However, with variable topology (for instance, modeling transmission line contingency), GNN models are able to straightforwardly take the change of topological information into account and outperform both FCNN and CNN models.  ( 2 min )
    Maximum Entropy Dueling Network Architecture in Atari Domain. (arXiv:2107.14457v2 [cs.LG] UPDATED)
    In recent years, there have been many deep structures for Reinforcement Learning, mainly for value function estimation and representations. These methods achieved great success in Atari 2600 domain. In this paper, we propose an improved architecture based upon Dueling Networks, in this architecture, there are two separate estimators, one approximate the state value function and the other, state advantage function. This improvement based on Maximum Entropy, shows better policy evaluation compared to the original network and other value-based architectures in Atari domain.  ( 2 min )
    Back2Future: Leveraging Backfill Dynamics for Improving Real-time Predictions in Future. (arXiv:2106.04420v8 [cs.LG] UPDATED)
    In real-time forecasting in public health, data collection is a non-trivial and demanding task. Often after initially released, it undergoes several revisions later (maybe due to human or technical constraints) - as a result, it may take weeks until the data reaches to a stable value. This so-called 'backfill' phenomenon and its effect on model performance has been barely studied in the prior literature. In this paper, we introduce the multi-variate backfill problem using COVID-19 as the motivating example. We construct a detailed dataset composed of relevant signals over the past year of the pandemic. We then systematically characterize several patterns in backfill dynamics and leverage our observations for formulating a novel problem and neural framework Back2Future that aims to refines a given model's predictions in real-time. Our extensive experiments demonstrate that our method refines the performance of top models for COVID-19 forecasting, in contrast to non-trivial baselines, yielding 18% improvement over baselines, enabling us obtain a new SOTA performance. In addition, we show that our model improves model evaluation too; hence policy-makers can better understand the true accuracy of forecasting models in real-time.  ( 3 min )
    SALIENCE: An Unsupervised User Adaptation Model for Multiple Wearable Sensors Based Human Activity Recognition. (arXiv:2108.10213v2 [eess.SP] UPDATED)
    Unsupervised user adaptation aligns the feature distributions of the data from training users and the new user, so a well-trained wearable human activity recognition (WHAR) model can be well adapted to the new user. With the development of wearable sensors, multiple wearable sensors based WHAR is gaining more and more attention. In order to address the challenge that the transferabilities of different sensors are different, we propose SALIENCE (unsupervised user adaptation model for multiple wearable sensors based human activity recognition) model. It aligns the data of each sensor separately to achieve local alignment, while uniformly aligning the data of all sensors to ensure global alignment. In addition, an attention mechanism is proposed to focus the activity classifier of SALIENCE on the sensors with strong feature discrimination and well distribution alignment. Experiments are conducted on two public WHAR datasets, and the experimental results show that our model can yield a competitive performance.  ( 2 min )
    SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation Space. (arXiv:2008.00397v2 [cs.CV] UPDATED)
    In this work, we formulate a visual dialog as an information flow in which each piece of information is encoded with the joint visual-linguistic representation of a single dialog round. Based on this formulation, we consider the visual dialog task as a sequence problem consisting of ordered visual-linguistic vectors. For featurization, we use a Dense Symmetric Co-Attention network as a lightweight vison-language joint representation generator to fuse multimodal features (i.e., image and text), yielding better computation and data efficiencies. For inference, we propose two Sequential Dialog Networks (SeqDialN): the first uses LSTM for information propagation (IP) and the second uses a modified Transformer for multi-step reasoning (MR). Our architecture separates the complexity of multimodal feature fusion from that of inference, which allows simpler design of the inference engine. IP based SeqDialN is our baseline with a simple 2-layer LSTM design that achieves decent performance. MR based SeqDialN, on the other hand, recurrently refines the semantic question/history representations through the self-attention stack of Transformer and produces promising results on the visual dialog task. On VisDial v1.0 test-std dataset, our best single generative SeqDialN achieves 62.54% NDCG and 48.63% MRR; our ensemble generative SeqDialN achieves 63.78% NDCG and 49.98% MRR, which set a new state-of-the-art generative visual dialog model. We fine-tune discriminative SeqDialN with dense annotations and boost the performance up to 72.41% NDCG and 55.11% MRR. In this work, we discuss the extensive experiments we have conducted to demonstrate the effectiveness of our model components. We also provide visualization for the reasoning process from the relevant conversation rounds and discuss our fine-tuning methods. Our code is available at https://github.com/xiaoxiaoheimei/SeqDialN  ( 3 min )
    IH-GAN: A Conditional Generative Model for Implicit Surface-Based Inverse Design of Cellular Structures. (arXiv:2103.02588v4 [cs.CE] UPDATED)
    Variable-density cellular structures can overcome connectivity and manufacturability issues of topologically optimized structures, particularly those represented as discrete density maps. However, the optimization of such cellular structures is challenging due to the multiscale design problem. Past work addressing this problem generally either only optimizes the volume fraction of single-type unit cells but ignores the effects of unit cell geometry on properties, or considers the geometry-property relation but builds this relation via heuristics. In contrast, we propose a simple yet more principled way to accurately model the property to geometry mapping using a conditional deep generative model, named Inverse Homogenization Generative Adversarial Network (IH-GAN). It learns the conditional distribution of unit cell geometries given properties and can realize the one-to-many mapping from properties to geometries. We further reduce the complexity of IH-GAN by using the implicit function parameterization to represent unit cell geometries. Results show that our method can 1) generate various unit cells that satisfy given material properties with high accuracy ($R^2$-scores between target properties and properties of generated unit cells $>98\%$) and 2) improve the optimized structural performance over the conventional variable-density single-type structure. In the minimum compliance example, our IH-GAN generated structure achieves a $79.7\%$ reduction in concentrated stress and an extra $3.03\%$ reduction in displacement. In the target deformation examples, our IH-GAN generated structure reduces the target matching error by $86.4\%$ and $79.6\%$ for two test cases, respectively. We also demonstrated that the connectivity issue for multi-type unit cells can be solved by transition layer blending.  ( 3 min )
    Unify Local and Global Information for Top-$N$ Recommendation. (arXiv:2012.01635v2 [cs.IR] UPDATED)
    Knowledge graph (KG), integrating complex information and containing rich semantics, is widely considered as side information to enhance the recommendation systems. However, most of the existing KG-based methods concentrate on encoding the structural information in the graph, without utilizing the collaborative signals in user-item interaction data, which are important for understanding user preferences. Therefore, the representations learned by these models are insufficient for representing semantic information of users and items in the recommendation environment. The combination of both kinds of data provides a good chance to solve this problem. To tackle this research gap, we propose a novel duet representation learning framework named \sysname to fuse local information (user-item interaction data) and global information (external knowledge graph) for the top-$N$ recommendation, which is composed of two separate sub-models. One learns the local representations by discovering the inner correlations in local information with a knowledge-aware co-attention mechanism, and another learns the global representations by encoding the knowledge associations in global information with a relation-aware attention network. The two sub-models are jointly trained as part of the semantic fusion network to compute the user preferences, which discriminates the contribution of the two sub-models under the special context. We conduct experiments on two real-world datasets, and the evaluations show that KADM significantly outperforms state-of-art methods. Further ablation studies confirm that the duet architecture performs significantly better than either sub-model on the recommendation tasks.  ( 2 min )
    Variational Kalman Filtering with Hinf-Based Correction for Robust Bayesian Learning in High Dimensions. (arXiv:2204.13089v1 [stat.ML])
    In this paper, we address the problem of convergence of sequential variational inference filter (VIF) through the application of a robust variational objective and Hinf-norm based correction for a linear Gaussian system. As the dimension of state or parameter space grows, performing the full Kalman update with the dense covariance matrix for a large scale system requires increased storage and computational complexity, making it impractical. The VIF approach, based on mean-field Gaussian variational inference, reduces this burden through the variational approximation to the covariance usually in the form of a diagonal covariance approximation. The challenge is to retain convergence and correct for biases introduced by the sequential VIF steps. We desire a framework that improves feasibility while still maintaining reasonable proximity to the optimal Kalman filter as data is assimilated. To accomplish this goal, a Hinf-norm based optimization perturbs the VIF covariance matrix to improve robustness. This yields a novel VIF- Hinf recursion that employs consecutive variational inference and Hinf based optimization steps. We explore the development of this method and investigate a numerical example to illustrate the effectiveness of the proposed filter.  ( 2 min )
    Exoskeleton-Based Multimodal Action and Movement Recognition: Identifying and Developing the Optimal Boosted Learning Approach. (arXiv:2106.10331v2 [cs.RO] UPDATED)
    This paper makes two scientific contributions to the field of exoskeleton-based action and movement recognition. First, it presents a novel machine learning and pattern recognition-based framework that can detect a wide range of actions and movements - walking, walking upstairs, walking downstairs, sitting, standing, lying, stand to sit, sit to stand, sit to lie, lie to sit, stand to lie, and lie to stand, with an overall accuracy of 82.63%. Second, it presents a comprehensive comparative study of different learning approaches - Random Forest, Artificial Neural Network, Decision Tree, Multiway Decision Tree, Support Vector Machine, k-NN, Gradient Boosted Trees, Decision Stump, AutoMLP, Linear Regression, Vector Linear Regression, Random Tree, Na\"ive Bayes, Na\"ive Bayes (Kernel), Linear Discriminant Analysis, Quadratic Discriminant Analysis, and Deep Learning applied to this framework. The performance of each of these learning approaches was boosted by using the AdaBoost algorithm, and the Cross Validation approach was used for training and testing. The results show that in boosted form, the k-NN classifier outperforms all the other boosted learning approaches and is, therefore, the optimal learning method for this purpose. The results presented and discussed uphold the importance of this work to contribute towards augmenting the abilities of exoskeleton-based assisted and independent living of the elderly in the future of Internet of Things-based living environments, such as Smart Homes. As a specific use case, we also discuss how the findings of our work are relevant for augmenting the capabilities of the Hybrid Assistive Limb exoskeleton, a highly functional lower limb exoskeleton.  ( 2 min )
    Residual Contrastive Learning for Image Reconstruction: Learning Transferable Representations from Noisy Images. (arXiv:2106.10070v2 [cs.CV] UPDATED)
    This paper is concerned with contrastive learning (CL) for low-level image restoration and enhancement tasks. We propose a new label-efficient learning paradigm based on residuals, residual contrastive learning (RCL), and derive an unsupervised visual representation learning framework, suitable for low-level vision tasks with noisy inputs. While supervised image reconstruction aims to minimize residual terms directly, RCL alternatively builds a connection between residuals and CL by defining a novel instance discrimination pretext task, using residuals as the discriminative feature. Our formulation mitigates the severe task misalignment between instance discrimination pretext tasks and downstream image reconstruction tasks, present in existing CL frameworks. Experimentally, we find that RCL can learn robust and transferable representations that improve the performance of various downstream tasks, such as denoising and super resolution, in comparison with recent self-supervised methods designed specifically for noisy inputs. Additionally, our unsupervised pre-training can significantly reduce annotation costs whilst maintaining performance competitive with fully-supervised image reconstruction.  ( 2 min )
    Neural String Edit Distance. (arXiv:2104.08388v2 [cs.CL] UPDATED)
    We propose the neural string edit distance model for string-pair matching and string transduction based on learnable string edit distance. We modify the original expectation-maximization learned edit distance algorithm into a differentiable loss function, allowing us to integrate it into a neural network providing a contextual representation of the input. We evaluate on cognate detection, transliteration, and grapheme-to-phoneme conversion, and show that we can trade off between performance and interpretability in a single framework. Using contextual representations, which are difficult to interpret, we match the performance of state-of-the-art string-pair matching models. Using static embeddings and a slightly different loss function, we force interpretability, at the expense of an accuracy drop.  ( 2 min )
    Federated Reconstruction: Partially Local Federated Learning. (arXiv:2102.03448v6 [cs.LG] UPDATED)
    Personalization methods in federated learning aim to balance the benefits of federated and local training for data availability, communication cost, and robustness to client heterogeneity. Approaches that require clients to communicate all model parameters can be undesirable due to privacy and communication constraints. Other approaches require always-available or stateful clients, impractical in large-scale cross-device settings. We introduce Federated Reconstruction, the first model-agnostic framework for partially local federated learning suitable for training and inference at scale. We motivate the framework via a connection to model-agnostic meta learning, empirically demonstrate its performance over existing approaches for collaborative filtering and next word prediction, and release an open-source library for evaluating approaches in this setting. We also describe the successful deployment of this approach at scale for federated collaborative filtering in a mobile keyboard application.  ( 2 min )
    Rethinking the Promotion Brought by Contrastive Learning to Semi-Supervised Node Classification. (arXiv:2012.07437v2 [cs.LG] UPDATED)
    Graph Contrastive Learning (GCL) has proven highly effective in promoting the performance of Semi-Supervised Node Classification (SSNC). However, existing GCL methods are generally transferred from other fields like CV or NLP, whose underlying working mechanism remains under-explored. In this work, we first deeply probe the working mechanism of GCL in SSNC, and find that the promotion brought by GCL is severely unevenly distributed: the improvement mainly comes from subgraphs with less annotated information, which is fundamentally different from contrastive learning in other fields. However, existing GCL methods generally ignore this uneven distribution of annotated information and apply GCL evenly to the whole graph. To remedy this issue and further improve GCL in SSNC, we propose the Topology InFormation gain-Aware Graph Contrastive Learning (TIFA-GCL) framework that considers the annotated information distribution across graph in GCL. Extensive experiments on six benchmark graph datasets, including the enormous OGB-Products graph, show that TIFA-GCL can bring a larger improvement than existing GCL methods in both transductive and inductive settings. Further experiments demonstrate the generalizability and interpretability of TIFA-GCL.  ( 2 min )
    Towards assessing agricultural land suitability with causal machine learning. (arXiv:2204.12956v1 [cs.LG])
    Understanding the suitability of agricultural land for applying specific management practices is of great importance for sustainable and resilient agriculture against climate change. Recent developments in the field of causal machine learning enable the estimation of intervention impacts on an outcome of interest, for samples described by a set of observed characteristics. We introduce an extensible data-driven framework that leverages earth observations and frames agricultural land suitability as a geospatial impact assessment problem, where the estimated effects of agricultural practices on agroecosystems serve as a land suitability score and guide decision making. We formulate this as a causal machine learning task and discuss how this approach can be used for agricultural planning in a changing climate. Specifically, we extract the agricultural management practices of "crop rotation" and "landscape crop diversity" from crop type maps, account for climate and land use data, and use double machine learning to estimate their heterogeneous effect on Net Primary Productivity (NPP), within the Flanders region of Belgium from 2010 to 2020. We find that the effect of crop rotation was insignificant, while landscape crop diversity had a small negative effect on NPP. Finally, we observe considerable effect heterogeneity in space for both practices and analyze it.  ( 2 min )
    Differentially Quantized Gradient Methods. (arXiv:2002.02508v4 [cs.LG] UPDATED)
    Consider the following distributed optimization scenario. A worker has access to training data that it uses to compute the gradients while a server decides when to stop iterative computation based on its target accuracy or delay constraints. The server receives all its information about the problem instance from the worker via a rate-limited noiseless communication channel. We introduce the principle we call Differential Quantization (DQ) that prescribes compensating the past quantization errors to direct the descent trajectory of a quantized algorithm towards that of its unquantized counterpart. Assuming that the objective function is smooth and strongly convex, we prove that Differentially Quantized Gradient Descent (DQ-GD) attains a linear contraction factor of $\max\{\sigma_{\mathrm{GD}}, \rho_n 2^{-R}\}$, where $\sigma_{\mathrm{GD}}$ is the contraction factor of unquantized gradient descent (GD), $\rho_n \geq 1$ is the covering efficiency of the quantizer, and $R$ is the bitrate per problem dimension $n$. Thus at any $R\geq\log_2 \rho_n /\sigma_{\mathrm{GD}}$ bits, the contraction factor of DQ-GD is the same as that of unquantized GD, i.e., there is no loss due to quantization. We show that no algorithm within a certain class can converge faster than $\max\{\sigma_{\mathrm{GD}}, 2^{-R}\}$. Since quantizers exist with $\rho_n \to 1$ as $n \to \infty$ (Rogers, 1963), this means that DQ-GD is asymptotically optimal. The principle of differential quantization continues to apply to gradient methods with momentum such as Nesterov's accelerated gradient descent, and Polyak's heavy ball method. For these algorithms as well, if the rate is above a certain threshold, there is no loss in contraction factor obtained by the differentially quantized algorithm compared to its unquantized counterpart. Experimental results on least-squares problems validate our theoretical analysis.  ( 3 min )
    Network Classification Based Structural Analysis of Real Networks and their Model-Generated Counterparts. (arXiv:1810.08498v4 [cs.SI] UPDATED)
    Data-driven analysis of complex networks has been in the focus of research for decades. An important area of research is to study how well real networks can be described with a small selection of metrics, furthermore how well network models can capture the relations between graph metrics observed in real networks. In this paper, we apply machine learning techniques to investigate the aforementioned problems. We study 500 real-world networks along with 2,000 synthetic networks generated by four frequently used network models with previously calibrated parameters to make the generated graphs as similar to the real networks as possible. This paper unifies several branches of data-driven complex network analysis, such as the study of graph metrics and their pair-wise relationships, network similarity estimation, model calibration, and graph classification. We find that the correlation profiles of the structural measures significantly differ across network domains and the domain can be efficiently determined using a small selection of graph metrics. The structural properties of the network models with fixed parameters are robust enough to perform parameter calibration. The goodness-of-fit of the network models highly depends on the network domain. By solving classification problems, we find that the models lack the capability of generating a graph with a high clustering coefficient and relatively large diameter simultaneously. On the other hand, models are able to capture exactly the degree-distribution-related metrics.  ( 2 min )
    Sequence-Based Target Coin Prediction for Cryptocurrency Pump-and-Dump. (arXiv:2204.12929v1 [q-fin.ST])
    As the pump-and-dump schemes (P&Ds) proliferate in the cryptocurrency market, it becomes imperative to detect such fraudulent activities in advance, to inform potentially susceptible investors before they become victims. In this paper, we focus on the target coin prediction task, i.e., to predict the pump probability of all coins listed in the target exchange before a pump. We conduct a comprehensive study of the latest P&Ds, investigate 709 events organized in Telegram channels from Jan. 2019 to Jan. 2022, and unearth some abnormal yet interesting patterns of P&Ds. Empirical analysis demonstrates that pumped coins exhibit intra-channel homogeneity and inter-channel heterogeneity, which inspires us to develop a novel sequence-based neural network named SNN. Specifically, SNN encodes each channel's pump history as a sequence representation via a positional attention mechanism, which filters useful information and alleviates the noise introduced when the sequence length is long. We also identify and address the coin-side cold-start problem in a practical setting. Extensive experiments show a lift of 1.6% AUC and 41.0% Hit Ratio@3 brought by our method, making it well-suited for real-world application. As a side contribution, we release the source code of our entire data science pipeline on GitHub, along with the dataset tailored for studying the latest P&Ds.  ( 2 min )
    Accurate inference of crowdsourcing properties when using efficient allocation strategies. (arXiv:1903.03104v2 [cs.LG] UPDATED)
    Allocation strategies improve the efficiency of crowdsourcing by decreasing the work needed to complete individual tasks accurately. However, these algorithms introduce bias by preferentially allocating workers onto easy tasks, leading to sets of completed tasks that are no longer representative of all tasks. This bias challenges inference of problem-wide properties such as typical task difficulty or crowd properties such as worker completion times, important information that goes beyond the crowd responses themselves. Here we study inference about problem properties when using an allocation algorithm to improve crowd efficiency. We introduce Decision-Explicit Probability Sampling (DEPS), a novel method to perform inference of problem properties while accounting for the potential bias introduced by an allocation strategy. Experiments on real and synthetic crowdsourcing data show that DEPS outperforms baseline inference methods while still leveraging the efficiency gains of the allocation method. The ability to perform accurate inference of general properties when using non-representative data allows crowdsourcers to extract more knowledge out of a given crowdsourced dataset.  ( 2 min )
    An Iterative Labeling Method for Annotating Fisheries Imagery. (arXiv:2204.12934v1 [cs.LG])
    In this paper, we present a methodology for fisheries-related data that allows us to converge on a labeled image dataset by iterating over the dataset with multiple training and production loops that can exploit crowdsourcing interfaces. We present our algorithm and its results on two separate sets of image data collected using the Seabed autonomous underwater vehicle. The first dataset comprises of 2,026 completely unlabeled images, while the second consists of 21,968 images that were point annotated by experts. Our results indicate that training with a small subset and iterating on that to build a larger set of labeled data allows us to converge to a fully annotated dataset with a small number of iterations. Even in the case of a dataset labeled by experts, a single iteration of the methodology improves the labels by discovering additional complicated examples of labels associated with fish that overlap, are very small, or obscured by the contrast limitations associated with underwater imagery.  ( 2 min )
    Treating Crowdsourcing as Examination: How to Score Tasks and Online Workers?. (arXiv:2204.13065v1 [cs.HC])
    Crowdsourcing is an online outsourcing mode which can solve the current machine learning algorithm's urge need for massive labeled data. Requester posts tasks on crowdsourcing platforms, which employ online workers over the Internet to complete tasks, then aggregate and return results to requester. How to model the interaction between different types of workers and tasks is a hot spot. In this paper, we try to model workers as four types based on their ability: expert, normal worker, sloppy worker and spammer, and divide tasks into hard, medium and easy task according to their difficulty. We believe that even experts struggle with difficult tasks while sloppy workers can get easy tasks right, and spammers always give out wrong answers deliberately. So, good examination tasks should have moderate degree of difficulty and discriminability to score workers more objectively. Thus, we first score workers' ability mainly on the medium difficult tasks, then reducing the weight of answers from sloppy workers and modifying the answers from spammers when inferring the tasks' ground truth. A probability graph model is adopted to simulate the task execution process, and an iterative method is adopted to calculate and update the ground truth, the ability of workers and the difficulty of the task successively. We verify the rightness and effectiveness of our algorithm both in simulated and real crowdsourcing scenes.  ( 2 min )
    NFT Appraisal Prediction: Utilizing Search Trends, Public Market Data, Linear Regression and Recurrent Neural Networks. (arXiv:2204.12932v1 [q-fin.ST])
    In this paper we investigate the correlation between NFT valuations and various features from three primary categories: public market data, NFT metadata, and social trends data.  ( 2 min )
    Bisimulation Makes Analogies in Goal-Conditioned Reinforcement Learning. (arXiv:2204.13060v1 [cs.LG])
    Building generalizable goal-conditioned agents from rich observations is a key to reinforcement learning (RL) solving real world problems. Traditionally in goal-conditioned RL, an agent is provided with the exact goal they intend to reach. However, it is often not realistic to know the configuration of the goal before performing a task. A more scalable framework would allow us to provide the agent with an example of an analogous task, and have the agent then infer what the goal should be for its current state. We propose a new form of state abstraction called goal-conditioned bisimulation that captures functional equivariance, allowing for the reuse of skills to achieve new goals. We learn this representation using a metric form of this abstraction, and show its ability to generalize to new goals in simulation manipulation tasks. Further, we prove that this learned representation is sufficient not only for goal conditioned tasks, but is amenable to any downstream task described by a state-only reward function. Videos can be found at https://sites.google.com/view/gc-bisimulation.  ( 2 min )
    Faster online calibration without randomization: interval forecasts and the power of two choices. (arXiv:2204.13087v1 [cs.LG])
    We study the problem of making calibrated probabilistic forecasts for a binary sequence generated by an adversarial nature. Following the seminal paper of Foster and Vohra (1998), nature is often modeled as an adaptive adversary who sees all activity of the forecaster except the randomization that the forecaster may deploy. A number of papers have proposed randomized forecasting strategies that achieve an $\epsilon$-calibration error rate of $O(1/\sqrt{T})$, which we prove is tight in general. On the other hand, it is well known that it is not possible to be calibrated without randomization, or if nature also sees the forecaster's randomization; in both cases the calibration error could be $\Omega(1)$. Inspired by the equally seminal works on the "power of two choices" and imprecise probability theory, we study a small variant of the standard online calibration problem. The adversary gives the forecaster the option of making two nearby probabilistic forecasts, or equivalently an interval forecast of small width, and the endpoint closest to the revealed outcome is used to judge calibration. This power of two choices, or imprecise forecast, accords the forecaster with significant power -- we show that a faster $\epsilon$-calibration rate of $O(1/T)$ can be achieved even without deploying any randomization.  ( 2 min )
    Can deep learning match the efficiency of human visual long-term memory to store object details?. (arXiv:2204.13061v1 [cs.LG])
    Humans have a remarkably large capacity to store detailed visual information in long-term memory even after a single exposure, as demonstrated by classic experiments in psychology. For example, Standing (1973) showed that humans could recognize with high accuracy thousands of pictures that they had seen only once a few days prior to a recognition test. In deep learning, the primary mode of incorporating new information into a model is through gradient descent in the model's parameter space. This paper asks whether deep learning via gradient descent can match the efficiency of human visual long-term memory to incorporate new information in a rigorous, head-to-head, quantitative comparison. We answer this in the negative: even in the best case, models learning via gradient descent appear to require approximately 10 exposures to the same visual materials in order to reach a recognition memory performance humans achieve after only a single exposure. Prior knowledge induced via pretraining and bigger model sizes improve performance, but these improvements are not very visible after a single exposure (it takes a few exposures for the improvements to become apparent), suggesting that simply scaling up the pretraining data size or model size might not be enough for the model to reach human-level memory efficiency.  ( 2 min )
    NLU++: A Multi-Label, Slot-Rich, Generalisable Dataset for Natural Language Understanding in Task-Oriented Dialogue. (arXiv:2204.13021v1 [cs.CL])
    We present NLU++, a novel dataset for natural language understanding (NLU) in task-oriented dialogue (ToD) systems, with the aim to provide a much more challenging evaluation environment for dialogue NLU models, up to date with the current application and industry requirements. NLU++ is divided into two domains (BANKING and HOTELS) and brings several crucial improvements over current commonly used NLU datasets. \textbf{1)} NLU++ provides fine-grained domain ontologies with a large set of challenging \textit{multi-intent} sentences, introducing and validating the idea of \textit{intent modules} that can be combined into complex intents that convey complex user goals, combined with finer-grained and thus more challenging slot sets. \textbf{2)} The ontology is divided into \textit{domain-specific} and \textit{generic} (i.e., domain-universal) intent modules that overlap across domains, promoting cross-domain reusability of annotated examples. \textbf{3)} The dataset design has been inspired by the problems observed in industrial ToD systems, and \textbf{4)} it has been collected, filtered and carefully annotated by dialogue NLU experts, yielding high-quality annotated data. Finally, we benchmark a series of current state-of-the-art NLU models on NLU++; the results demonstrate the challenging nature of the dataset, especially in low-data regimes, the validity of `intent modularisation', and call for further research on ToD NLU.  ( 2 min )
    Dropout Inference with Non-Uniform Weight Scaling. (arXiv:2204.13047v1 [cs.LG])
    Dropout as regularization has been used extensively to prevent overfitting for training neural networks. During training, units and their connections are randomly dropped, which could be considered as sampling many different submodels from the original model. At test time, weight scaling and Monte Carlo approximation are two widely applied approaches to approximate the outputs. Both approaches work well practically when all submodels are low-bias complex learners. However, in this work, we demonstrate scenarios where some submodels behave closer to high-bias models and a non-uniform weight scaling is a better approximation for inference.  ( 2 min )
    Binding Actions to Objects in World Models. (arXiv:2204.13022v1 [cs.LG])
    We study the problem of binding actions to objects in object-factored world models using action-attention mechanisms. We propose two attention mechanisms for binding actions to objects, soft attention and hard attention, which we evaluate in the context of structured world models for five environments. Our experiments show that hard attention helps contrastively-trained structured world models to learn to separate individual objects in an object-based grid-world environment. Further, we show that soft attention increases performance of factored world models trained on a robotic manipulation task. The learned action attention weights can be used to interpret the factored world model as the attention focuses on the manipulated object in the environment.  ( 2 min )
    Unsupervised Learning of Unbiased Visual Representations. (arXiv:2204.12941v1 [cs.LG])
    Deep neural networks are known for their inability to learn robust representations when biases exist in the dataset. This results in a poor generalization to unbiased datasets, as the predictions strongly rely on peripheral and confounding factors, which are erroneously learned by the network. Many existing works deal with this issue by either employing an explicit supervision on the bias attributes, or assuming prior knowledge about the bias. In this work we study this problem in a more difficult scenario, in which no explicit annotation about the bias is available, and without any prior knowledge about its nature. We propose a fully unsupervised debiasing framework, consisting of three steps: first, we exploit the natural preference for learning malignant biases, obtaining a bias-capturing model; then, we perform a pseudo-labelling step to obtain bias labels; finally we employ state-of-the-art supervised debiasing techniques to obtain an unbiased model. We also propose a theoretical framework to assess the biasness of a model, and provide a detailed analysis on how biases affect the training of neural networks. We perform experiments on synthetic and real-world datasets, showing that our method achieves state-of-the-art performance in a variety of settings, sometimes even higher than fully supervised debiasing approaches.  ( 2 min )
    Learning to Transfer Role Assignment Across Team Sizes. (arXiv:2204.12937v1 [cs.LG])
    Multi-agent reinforcement learning holds the key for solving complex tasks that demand the coordination of learning agents. However, strong coordination often leads to expensive exploration over the exponentially large state-action space. A powerful approach is to decompose team works into roles, which are ideally assigned to agents with the relevant skills. Training agents to adaptively choose and play emerging roles in a team thus allows the team to scale to complex tasks and quickly adapt to changing environments. These promises, however, have not been fully realised by current role-based multi-agent reinforcement learning methods as they assume either a pre-defined role structure or a fixed team size. We propose a framework to learn role assignment and transfer across team sizes. In particular, we train a role assignment network for small teams by demonstration and transfer the network to larger teams, which continue to learn through interaction with the environment. We demonstrate that re-using the role-based credit assignment structure can foster the learning process of larger reinforcement learning teams to achieve tasks requiring different roles. Our proposal outperforms competing techniques in enriched role-enforcing Prey-Predator games and in new scenarios in the StarCraft II Micro-Management benchmark.  ( 2 min )
    Multi-Objective Physics-Guided Recurrent Neural Networks for Identifying Non-Autonomous Dynamical Systems. (arXiv:2204.12972v1 [eess.SY])
    While trade-offs between modeling effort and model accuracy remain a major concern with system identification, resorting to data-driven methods often leads to a complete disregard for physical plausibility. To address this issue, we propose a physics-guided hybrid approach for modeling non-autonomous systems under control. Starting from a traditional physics-based model, this is extended by a recurrent neural network and trained using a sophisticated multi-objective strategy yielding physically plausible models. While purely data-driven methods fail to produce satisfying results, experiments conducted on real data reveal substantial accuracy improvements by our approach compared to a physics-based model.  ( 2 min )
    Domain Knowledge-Infused Deep Learning for Automated Analog/Radio-Frequency Circuit Parameter Optimization. (arXiv:2204.12948v1 [cs.LG])
    The design automation of analog circuits is a longstanding challenge. This paper presents a reinforcement learning method enhanced by graph learning to automate the analog circuit parameter optimization at the pre-layout stage, i.e., finding device parameters to fulfill desired circuit specifications. Unlike all prior methods, our approach is inspired by human experts who rely on domain knowledge of analog circuit design (e.g., circuit topology and couplings between circuit specifications) to tackle the problem. By originally incorporating such key domain knowledge into policy training with a multimodal network, the method best learns the complex relations between circuit parameters and design targets, enabling optimal decisions in the optimization process. Experimental results on exemplary circuits show it achieves human-level design accuracy (99%) 1.5X efficiency of existing best-performing methods. Our method also shows better generalization ability to unseen specifications and optimality in circuit performance optimization. Moreover, it applies to design radio-frequency circuits on emerging semiconductor technologies, breaking the limitations of prior learning methods in designing conventional analog circuits.  ( 2 min )
    Meshless method stencil evaluation with machine learning. (arXiv:2204.12940v1 [cs.LG])
    Meshless methods are an active and modern branch of numerical analysis with many intriguing benefits. One of the main open research questions related to local meshless methods is how to select the best possible stencil - a collection of neighbouring nodes - to base the calculation on. In this paper, we describe the procedure for generating a labelled stencil dataset and use a variation of pointNet - a deep learning network based on point clouds - to create a classifier for the quality of the stencil. We exploit features of pointNet to implement a model that can be used to classify differently sized stencils and compare it against models dedicated to a single stencil size. The model is particularly good at detecting the best and the worst stencils with a respectable area under the curve (AUC) metric of around 0.90. There is much potential for further improvement and direct application in the meshless domain.  ( 2 min )
    GypSum: Learning Hybrid Representations for Code Summarization. (arXiv:2204.12916v1 [cs.SE])
    Code summarization with deep learning has been widely studied in recent years. Current deep learning models for code summarization generally follow the principle in neural machine translation and adopt the encoder-decoder framework, where the encoder learns the semantic representations from source code and the decoder transforms the learnt representations into human-readable text that describes the functionality of code snippets. Despite they achieve the new state-of-the-art performance, we notice that current models often either generate less fluent summaries, or fail to capture the core functionality, since they usually focus on a single type of code representations. As such we propose GypSum, a new deep learning model that learns hybrid representations using graph attention neural networks and a pre-trained programming and natural language model. We introduce particular edges related to the control flow of a code snippet into the abstract syntax tree for graph construction, and design two encoders to learn from the graph and the token sequence of source code, respectively. We modify the encoder-decoder sublayer in the Transformer's decoder to fuse the representations and propose a dual-copy mechanism to facilitate summary generation. Experimental results demonstrate the superior performance of GypSum over existing code summarization models.  ( 2 min )
    A Bayesian Approach To Graph Partitioning. (arXiv:2204.12927v1 [cs.LG])
    A new algorithm based on bayesian inference for learning local graph conductance based on Gaussian Process(GP) is given that uses advanced MCMC convergence ideas to create a scalable and fast algorithm for convergence to stationary distribution which is provided to learn the bahavior of conductance when traversing the indirected weighted graph. First metric embedding is used to represent the vertices of the graph. Then, uniform induced conductance is calculated for training points. Finally, in the learning step, a gaussian process is used to approximate the uniform induced conductance. MCMC is used to measure uncertainty of estimated hyper-parameters.  ( 2 min )
    Using the Projected Belief Network at High Dimensions. (arXiv:2204.12922v1 [cs.LG])
    The projected belief network (PBN) is a layered generative network (LGN) with tractable likelihood function, and is based on a feed-forward neural network (FFNN). There are two versions of the PBN: stochastic and deterministic (D-PBN), and each has theoretical advantages over other LGNs. However, implementation of the PBN requires an iterative algorithm that includes the inversion of a symmetric matrix of size M X M in each layer, where M is the layer output dimension. This, and the fact that the network must be always dimension-reducing in each layer, can limit the types of problems where the PBN can be applied. In this paper, we describe techniques to avoid or mitigate these restrictions and use the PBN effectively at high dimension. We apply the discriminatively aligned PBN (PBN-DA) to classifying and auto-encoding high-dimensional spectrograms of acoustic events. We also present the discriminatively aligned D-PBN for the first time.  ( 2 min )
    Discovering Quantum Phase Transitions with Fermionic Neural Networks. (arXiv:2202.05183v2 [physics.comp-ph] UPDATED)
    Deep neural networks have been extremely successful as highly accurate wave function ans\"atze for variational Monte Carlo calculations of molecular ground states. We present an extension of one such ansatz, FermiNet, to calculations of the ground states of periodic Hamiltonians, and study the homogeneous electron gas. FermiNet calculations of the ground-state energies of small electron gas systems are in excellent agreement with previous initiator full configuration interaction quantum Monte Carlo and diffusion Monte Carlo calculations. We investigate the spin-polarized homogeneous electron gas and demonstrate that the same neural network architecture is capable of accurately representing both the delocalized Fermi liquid state and the localized Wigner crystal state. The network is given no \emph{a priori} knowledge that a phase transition exists, but converges on the translationally invariant ground state at high density and spontaneously breaks the symmetry to produce the crystalline ground state at low density.  ( 2 min )
    Variance-Reduced Heterogeneous Federated Learning via Stratified Client Selection. (arXiv:2201.05762v2 [cs.LG] UPDATED)
    Client selection strategies are widely adopted to handle the communication-efficient problem in recent studies of Federated Learning (FL). However, due to the large variance of the selected subset's update, prior selection approaches with a limited sampling ratio cannot perform well on convergence and accuracy in heterogeneous FL. To address this problem, in this paper, we propose a novel stratified client selection scheme to reduce the variance for the pursuit of better convergence and higher accuracy. Specifically, to mitigate the impact of heterogeneity, we develop stratification based on clients' local data distribution to derive approximate homogeneous strata for better selection in each stratum. Concentrating on a limited sampling ratio scenario, we next present an optimized sample size allocation scheme by considering the diversity of stratum's variability, with the promise of further variance reduction. Theoretically, we elaborate the explicit relation among different selection schemes with regard to variance, under heterogeneous settings, we demonstrate the effectiveness of our selection scheme. Experimental results confirm that our approach not only allows for better performance relative to state-of-the-art methods but also is compatible with prevalent FL algorithms.  ( 2 min )
    Explainable k-means. Don't be greedy, plant bigger trees!. (arXiv:2111.03193v2 [cs.LG] UPDATED)
    We provide a new bi-criteria $\tilde{O}(\log^2 k)$ competitive algorithm for explainable $k$-means clustering. Explainable $k$-means was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). It is described by an easy to interpret and understand (threshold) decision tree or diagram. The cost of the explainable $k$-means clustering equals to the sum of costs of its clusters; and the cost of each cluster equals the sum of squared distances from the points in the cluster to the center of that cluster. The best non bi-criteria algorithm for explainable clustering $\tilde{O}(k)$ competitive, and this bound is tight. Our randomized bi-criteria algorithm constructs a threshold decision tree that partitions the data set into $(1+\delta)k$ clusters (where $\delta\in (0,1)$ is a parameter of the algorithm). The cost of this clustering is at most $\tilde{O}(1/ \delta \cdot \log^2 k)$ times the cost of the optimal unconstrained $k$-means clustering. We show that this bound is almost optimal.  ( 2 min )
    Certified Robustness via Randomized Smoothing over Multiplicative Parameters. (arXiv:2106.14432v2 [cs.LG] UPDATED)
    Currently the most popular method of providing robustness certificates is randomized smoothing where an input is smoothed via some probability distribution. We propose a novel approach to randomized smoothing over multiplicative parameters. Using this method we construct certifiably robust classifiers with respect to a gamma correction perturbation and compare the result with classifiers obtained via other smoothing distributions (Gaussian, Laplace, uniform). The experiments show that asymmetrical Rayleigh distribution allows to obtain better certificates for some values of perturbation parameters. To the best of our knowledge it is the first work concerning certified robustness against the multiplicative gamma correction transformation and the first to study effects of asymmetrical distributions in randomized smoothing.  ( 2 min )
    Generalization Bounds with Minimal Dependency on Hypothesis Class via Distributionally Robust Optimization. (arXiv:2106.11180v2 [math.OC] UPDATED)
    Established approaches to obtain generalization bounds in data-driven optimization and machine learning mostly build on solutions from empirical risk minimization (ERM), which depend crucially on the functional complexity of the hypothesis class. In this paper, we present an alternate route to obtain these bounds on the solution from distributionally robust optimization (DRO), a recent data-driven optimization framework based on worst-case analysis and the notion of ambiguity set to capture statistical uncertainty. In contrast to the hypothesis class complexity in ERM, our DRO bounds depend on the ambiguity set geometry and its compatibility with the true loss function. Notably, when using maximum mean discrepancy as a DRO distance metric, our analysis implies generalization bounds that depend solely on the true loss function. To the best of our knowledge, it is the first generalization bound in the literature that is entirely independent of any other candidates in the hypothesis class. We hope our findings can open the door for a better understanding of DRO, especially its benefits on loss minimization and other machine learning applications.  ( 2 min )
    BINAS: Bilinear Interpretable Neural Architecture Search. (arXiv:2110.12399v3 [cs.LG] UPDATED)
    Practical use of neural networks often involves requirements on latency, energy and memory among others. A popular approach to find networks under such requirements is through constrained Neural Architecture Search (NAS). However, previous methods use complicated predictors for the accuracy of the network. Those predictors are hard to interpret and sensitive to many hyperparameters to be tuned, hence, the resulting accuracy of the generated models is often harmed. In this work we resolve this by introducing Bilinear Interpretable Neural Architecture Search (BINAS), that is based on an accurate and simple bilinear formulation of both an accuracy estimator and the expected resource requirement, together with a scalable search method with theoretical guarantees. The simplicity of our proposed estimator together with the intuitive way it is constructed bring interpretability through many insights about the contribution of different design choices. For example, we find that in the examined search space, adding depth and width is more effective at deeper stages of the network and at the beginning of each resolution stage. Our experiments show that BINAS generates comparable to or better architectures than other state-of-the-art NAS methods within a reduced marginal search cost, while strictly satisfying the resource constraints.  ( 2 min )
    Online Deep Learning from Doubly-Streaming Data. (arXiv:2204.11793v2 [cs.LG] UPDATED)
    This paper investigates a new online learning problem with doubly-streaming data, where the data streams are described by feature spaces that constantly evolve, with new features emerging and old features fading away. The challenges of this problem are two folds: 1) Data samples ceaselessly flowing in may carry shifted patterns over time, requiring learners to update hence adapt on-the-fly. 2) Newly emerging features are described by very few samples, resulting in weak learners that tend to make error predictions. A plausible idea to overcome the challenges is to establish relationship between the pre-and-post evolving feature spaces, so that an online learner can leverage the knowledge learned from the old features to better the learning performance on the new features. Unfortunately, this idea does not scale up to high-dimensional media streams with complex feature interplay, which suffers an tradeoff between onlineness (biasing shallow learners) and expressiveness(requiring deep learners). Motivated by this, we propose a novel OLD^3S paradigm, where a shared latent subspace is discovered to summarize information from the old and new feature spaces, building intermediate feature mapping relationship. A key trait of OLD^3S is to treat the model capacity as a learnable semantics, yields optimal model depth and parameters jointly, in accordance with the complexity and non-linearity of the input data streams in an online fashion. Both theoretical analyses and empirical studies substantiate the viability and effectiveness of our proposal.  ( 2 min )
    TERMinator: A Neural Framework for Structure-Based Protein Design using Tertiary Repeating Motifs. (arXiv:2204.13048v1 [q-bio.BM])
    Computational protein design has the potential to deliver novel molecular structures, binders, and catalysts for myriad applications. Recent neural graph-based models that use backbone coordinate-derived features show exceptional performance on native sequence recovery tasks and are promising frameworks for design. A statistical framework for modeling protein sequence landscapes using Tertiary Motifs (TERMs), compact units of recurring structure in proteins, has also demonstrated good performance on protein design tasks. In this work, we investigate the use of TERM-derived data as features in neural protein design frameworks. Our graph-based architecture, TERMinator, incorporates TERM-based and coordinate-based information and outputs a Potts model over sequence space. TERMinator outperforms state-of-the-art models on native sequence recovery tasks, suggesting that utilizing TERM-based and coordinate-based features together is beneficial for protein design.  ( 2 min )
    An Empirical Evaluation of Flow Based Programming in the Machine Learning Deployment Context. (arXiv:2204.12781v1 [cs.SE])
    As use of data driven technologies spreads, software engineers are more often faced with the task of solving a business problem using data-driven methods such as machine learning (ML) algorithms. Deployment of ML within large software systems brings new challenges that are not addressed by standard engineering practices and as a result businesses observe high rate of ML deployment project failures. Data Oriented Architecture (DOA) is an emerging approach that can support data scientists and software developers when addressing such challenges. However, there is a lack of clarity about how DOA systems should be implemented in practice. This paper proposes to consider Flow-Based Programming (FBP) as a paradigm for creating DOA applications. We empirically evaluate FBP in the context of ML deployment on four applications that represent typical data science projects. We use Service Oriented Architecture (SOA) as a baseline for comparison. Evaluation is done with respect to different application domains, ML deployment stages, and code quality metrics. Results reveal that FBP is a suitable paradigm for data collection and data science tasks, and is able to simplify data collection and discovery when compared with SOA. We discuss the advantages of FBP as well as the gaps that need to be addressed to increase FBP adoption as a standard design paradigm for DOA.  ( 2 min )
    MAPLE-Edge: A Runtime Latency Predictor for Edge Devices. (arXiv:2204.12950v1 [cs.LG])
    Neural Architecture Search (NAS) has enabled automatic discovery of more efficient neural network architectures, especially for mobile and embedded vision applications. Although recent research has proposed ways of quickly estimating latency on unseen hardware devices with just a few samples, little focus has been given to the challenges of estimating latency on runtimes using optimized graphs, such as TensorRT and specifically for edge devices. In this work, we propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware, where we train a regression network on architecture-latency pairs in conjunction with a hardware-runtime descriptor to effectively estimate latency on a diverse pool of edge devices. Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters that are widely available on all Linux kernels, while still achieving up to +49.6% accuracy gains against previous state-of-the-art baseline methods on optimized edge device runtimes, using just 10 measurements from an unseen target device. We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes by applying a trick of normalizing performance counters by the operator latency, in the measured hardware-runtime descriptor. Lastly, we show that for runtimes exhibiting lower than desired accuracy, performance can be boosted by collecting additional samples from the target device, with an extra 90 samples translating to gains of nearly +40%.  ( 2 min )
    Uncertainty-Aware Prediction of Battery Energy Consumption for Hybrid Electric Vehicles. (arXiv:2204.12825v1 [cs.LG])
    The usability of vehicles is highly dependent on their energy consumption. In particular, one of the main factors hindering the mass adoption of electric (EV), hybrid (HEV), and plug-in hybrid (PHEV) vehicles is range anxiety, which occurs when a driver is uncertain about the availability of energy for a given trip. To tackle this problem, we propose a machine learning approach for modeling the battery energy consumption. By reducing predictive uncertainty, this method can help increase trust in the vehicle's performance and thus boost its usability. Most related work focuses on physical and/or chemical models of the battery that affect the energy consumption. We propose a data-driven approach which relies on real-world datasets including battery related attributes. Our approach showed an improvement in terms of predictive uncertainty as well as in accuracy compared to traditional methods.  ( 2 min )
    Adaptable Text Matching via Meta-Weight Regulator. (arXiv:2204.12668v1 [cs.IR])
    Neural text matching models have been used in a range of applications such as question answering and natural language inference, and have yielded a good performance. However, these neural models are of a limited adaptability, resulting in a decline in performance when encountering test examples from a different dataset or even a different task. The adaptability is particularly important in the few-shot setting: in many cases, there is only a limited amount of labeled data available for a target dataset or task, while we may have access to a richly labeled source dataset or task. However, adapting a model trained on the abundant source data to a few-shot target dataset or task is challenging. To tackle this challenge, we propose a Meta-Weight Regulator (MWR), which is a meta-learning approach that learns to assign weights to the source examples based on their relevance to the target loss. Specifically, MWR first trains the model on the uniformly weighted source examples, and measures the efficacy of the model on the target examples via a loss function. By iteratively performing a (meta) gradient descent, high-order gradients are propagated to the source examples. These gradients are then used to update the weights of source examples, in a way that is relevant to the target performance. As MWR is model-agnostic, it can be applied to any backbone neural model. Extensive experiments are conducted with various backbone text matching models, on four widely used datasets and two tasks. The results demonstrate that our proposed approach significantly outperforms a number of existing adaptation methods and effectively improves the cross-dataset and cross-task adaptability of the neural text matching models in the few-shot setting.  ( 2 min )
    Topological Data Analysis for Anomaly Detection in Host-Based Logs. (arXiv:2204.12919v1 [cs.LG])
    Topological Data Analysis (TDA) gives practioners the ability to analyse the global structure of cybersecurity data. We use TDA for anomaly detection in host-based logs collected with the open-source Logging Made Easy (LME) project. We present an approach that builds a filtration of simplicial complexes directly from Windows logs, enabling analysis of their intrinsic structure using topological tools. We compare the efficacy of persistent homology and the spectrum of graph and hypergraph Laplacians as feature vectors against a standard log embedding that counts events, and find that topological and spectral embeddings of computer logs contain discriminative information for classifying anomalous logs that is complementary to standard embeddings. We end by discussing the potential for our methods to be used as part of an explainable framework for anomaly detection.  ( 2 min )
    FlowGNN: A Dataflow Architecture for Universal Graph Neural Network Inference via Multi-Queue Streaming. (arXiv:2204.13103v1 [cs.DC])
    Graph neural networks (GNNs) have recently exploded in popularity thanks to their broad applicability to graph-related problems such as quantum chemistry, drug discovery, and high energy physics. However, meeting demand for novel GNN models and fast inference simultaneously is challenging because of the gap between developing efficient accelerators and the rapid creation of new GNN models. Prior art focuses on the acceleration of specific classes of GNNs, such as Graph Convolutional Network (GCN), but lacks the generality to support a wide range of existing or new GNN models. Meanwhile, most work rely on graph pre-processing to exploit data locality, making them unsuitable for real-time applications. To address these limitations, in this work, we propose a generic dataflow architecture for GNN acceleration, named FlowGNN, which can flexibly support the majority of message-passing GNNs. The contributions are three-fold. First, we propose a novel and scalable dataflow architecture, which flexibly supports a wide range of GNN models with message-passing mechanism. The architecture features a configurable dataflow optimized for simultaneous computation of node embedding, edge embedding, and message passing, which is generally applicable to all models. We also propose a rich library of model-specific components. Second, we deliver ultra-fast real-time GNN inference without any graph pre-processing, making it agnostic to dynamically changing graph structures. Third, we verify our architecture on the Xilinx Alveo U50 FPGA board and measure the on-board end-to-end performance. We achieve a speed-up of up to 51-254x against CPU (6226R) and 1.3-477x against GPU (A6000) (with batch sizes 1 through 1024); we also outperform the SOTA GNN accelerator I-GCN by 1.03x and 1.25x across two datasets. Our implementation code and on-board measurement are publicly available on GitHub.  ( 2 min )
    Trainable Compound Activation Functions for Machine Learning. (arXiv:2204.12920v1 [cs.LG])
    Activation functions (AF) are necessary components of neural networks that allow approximation of functions, but AFs in current use are usually simple monotonically increasing functions. In this paper, we propose trainable compound AF (TCA) composed of a sum of shifted and scaled simple AFs. TCAs increase the effectiveness of networks with fewer parameters compared to added layers. TCAs have a special interpretation in generative networks because they effectively estimate the marginal distributions of each dimension of the data using a mixture distribution, reducing modality and making linear dimension reduction more effective. When used in restricted Boltzmann machines (RBMs), they result in a novel type of RBM with mixture-based stochastic units. Improved performance is demonstrated in experiments using RBMs, deep belief networks (DBN), projected belief networks (PBN), and variational auto-encoders (VAE).  ( 2 min )
    Spending Privacy Budget Fairly and Wisely. (arXiv:2204.12903v1 [cs.LG])
    Differentially private (DP) synthetic data generation is a practical method for improving access to data as a means to encourage productive partnerships. One issue inherent to DP is that the "privacy budget" is generally "spent" evenly across features in the data set. This leads to good statistical parity with the real data, but can undervalue the conditional probabilities and marginals that are critical for predictive quality of synthetic data. Further, loss of predictive quality may be non-uniform across the data set, with subsets that correspond to minority groups potentially suffering a higher loss. In this paper, we develop ensemble methods that distribute the privacy budget "wisely" to maximize predictive accuracy of models trained on DP data, and "fairly" to bound potential disparities in accuracy across groups and reduce inequality. Our methods are based on the insights that feature importance can inform how privacy budget is allocated, and, further, that per-group feature importance and fairness-related performance objectives can be incorporated in the allocation. These insights make our methods tunable to social contexts, allowing data owners to produce balanced synthetic data for predictive analysis.  ( 2 min )
    First do no harm: counterfactual objective functions for safe & ethical AI. (arXiv:2204.12993v1 [cs.AI])
    To act safely and ethically in the real world, agents must be able to reason about harm and avoid harmful actions. In this paper we develop the first statistical definition of harm and a framework for factoring harm into algorithmic decisions. We argue that harm is fundamentally a counterfactual quantity, and show that standard machine learning algorithms are guaranteed to pursue harmful policies in certain environments. To resolve this, we derive a family of counterfactual objective functions that robustly mitigate for harm. We demonstrate our approach with a statistical model for identifying optimal drug doses. While identifying optimal doses using the causal treatment effect results in harmful treatment decisions, our counterfactual algorithm identifies doses that are far less harmful without sacrificing efficacy. Our results show that counterfactual reasoning is a key ingredient for safe and ethical AI.  ( 2 min )
    Ollivier-Ricci Curvature For Head Pose Estimation From a Single Image. (arXiv:2204.13006v1 [cs.CV])
    Head pose estimation is a crucial challenge for many real-world applications, such as attention and human behavior analysis. This paper aims to estimate head pose from a single image by applying notions of network curvature. In the real world, many complex networks have groups of nodes that are well connected to each other with significant functional roles. Similarly, the interactions of facial landmarks can be represented as complex dynamic systems modeled by weighted graphs. The functionalities of such systems are therefore intrinsically linked to the topology and geometry of the underlying graph. In this work, using the geometric notion of Ollivier-Ricci curvature (ORC) on weighted graphs as input to the XGBoost regression model, we show that the intrinsic geometric basis of ORC offers a natural approach to discovering underlying common structure within a pool of poses. Experiments on the BIWI, AFLW2000 and Pointing'04 datasets show that the ORC_XGB method performs well compared to state-of-the-art methods, both landmark-based and image-only.
    Scalable particle-based alternatives to EM. (arXiv:2204.12965v1 [stat.CO])
    Building on (Neal and Hinton, 1998), where the problem tackled by EM is recast as the optimization of a free energy functional on an infinite-dimensional space, we obtain three practical particle-based alternatives to EM applicable to broad classes of models. All three are derived through straightforward discretizations of gradient flows associated with the functional. The novel algorithms scale well to high-dimensional settings and outperform existing state-of-the-art methods in numerical experiments.
    Epicardial Adipose Tissue Segmentation from CT Images with A Semi-3D Neural Network. (arXiv:2204.12904v1 [eess.IV])
    Epicardial adipose tissue is a type of adipose tissue located between the heart wall and a protective layer around the heart called the pericardium. The volume and thickness of epicardial adipose tissue are linked to various cardiovascular diseases. It is shown to be an independent cardiovascular disease risk factor. Fully automatic and reliable measurements of epicardial adipose tissue from CT scans could provide better disease risk assessment and enable the processing of large CT image data sets for a systemic epicardial adipose tissue study. This paper proposes a method for fully automatic semantic segmentation of epicardial adipose tissue from CT images using a deep neural network. The proposed network uses a U-Net-based architecture with slice depth information embedded in the input image to segment a pericardium region of interest, which is used to obtain an epicardial adipose tissue segmentation. Image augmentation is used to increase model robustness. Cross-validation of the proposed method yields a Dice score of 0.86 on the CT scans of 20 patients.  ( 2 min )
    On the Dynamics of Inference and Learning. (arXiv:2204.12939v1 [cond-mat.dis-nn])
    Statistical Inference is the process of determining a probability distribution over the space of parameters of a model given a data set. As more data becomes available this probability distribution becomes updated via the application of Bayes' theorem. We present a treatment of this Bayesian updating process as a continuous dynamical system. Statistical inference is then governed by a first order differential equation describing a trajectory or flow in the information geometry determined by a parametric family of models. We solve this equation for some simple models and show that when the Cram\'{e}r-Rao bound is saturated the learning rate is governed by a simple $1/T$ power-law, with $T$ a time-like variable denoting the quantity of data. The presence of hidden variables can be incorporated in this setting, leading to an additional driving term in the resulting flow equation. We illustrate this with both analytic and numerical examples based on Gaussians and Gaussian Random Processes and inference of the coupling constant in the 1D Ising model. Finally we compare the qualitative behaviour exhibited by Bayesian flows to the training of various neural networks on benchmarked data sets such as MNIST and CIFAR10 and show how that for networks exhibiting small final losses the simple power-law is also satisfied.  ( 2 min )
    Improving Feature Generalizability with Multitask Learning in Class Incremental Learning. (arXiv:2204.12915v1 [cs.LG])
    Many deep learning applications, like keyword spotting, require the incorporation of new concepts (classes) over time, referred to as Class Incremental Learning (CIL). The major challenge in CIL is catastrophic forgetting, i.e., preserving as much of the old knowledge as possible while learning new tasks. Various techniques, such as regularization, knowledge distillation, and the use of exemplars, have been proposed to resolve this issue. However, prior works primarily focus on the incremental learning step, while ignoring the optimization during the base model training. We hypothesize that a more transferable and generalizable feature representation from the base model would be beneficial to incremental learning. In this work, we adopt multitask learning during base model training to improve the feature generalizability. Specifically, instead of training a single model with all the base classes, we decompose the base classes into multiple subsets and regard each of them as a task. These tasks are trained concurrently and a shared feature extractor is obtained for incremental learning. We evaluate our approach on two datasets under various configurations. The results show that our approach enhances the average incremental learning accuracy by up to 5.5%, which enables more reliable and accurate keyword spotting over time. Moreover, the proposed approach can be combined with many existing techniques and provides additional performance gain.  ( 2 min )
    Forecasting Foreign Exchange Rates With Parameter-Free Regression Networks Tuned By Bayesian Optimization. (arXiv:2204.12914v1 [q-fin.ST])
    The article is concerned with the problem of multi-step financial time series forecasting of Foreign Exchange (FX) rates. To address this problem, we introduce a parameter-free regression network termed RegPred Net. The exchange rate to forecast is treated as a stochastic process. It is assumed to follow a generalization of Brownian motion and the mean-reverting process referred to as the generalized Ornstein-Uhlenbeck (OU) process, with time-dependent coefficients. Using past observed values of the input time series, these coefficients can be regressed online by the cells of the first half of the network (Reg). The regressed coefficients depend only on - but are very sensitive to - a small number of hyperparameters required to be set by a global optimization procedure for which, Bayesian optimization is an adequate heuristic. Thanks to its multi-layered architecture, the second half of the regression network (Pred) can project time-dependent values for the OU process coefficients and generate realistic trajectories of the time series. Predictions can be easily derived in the form of expected values estimated by averaging values obtained by Monte Carlo simulation. The forecasting accuracy on a 100 days horizon is evaluated for several of the most important FX rates such as EUR/USD, EUR/CNY, and EUR/GBP. Our experimental results show that the RegPred Net significantly outperforms ARMA, ARIMA, LSTMs, and Autoencoder-LSTM models in this task.  ( 2 min )
    LiftPool: Lifting-based Graph Pooling for Hierarchical Graph Representation Learning. (arXiv:2204.12881v1 [cs.LG])
    Graph pooling has been increasingly considered for graph neural networks (GNNs) to facilitate hierarchical graph representation learning. Existing graph pooling methods commonly consist of two stages, i.e., selecting the top-ranked nodes and removing the rest nodes to construct a coarsened graph representation. However, local structural information of the removed nodes would be inevitably dropped in these methods, due to the inherent coupling of nodes (location) and their features (signals). In this paper, we propose an enhanced three-stage method via lifting, named LiftPool, to improve hierarchical graph representation by maximally preserving the local structural information in graph pooling. LiftPool introduces an additional stage of graph lifting before graph coarsening to preserve the local information of the removed nodes and decouple the processes of node removing and feature reduction. Specifically, for each node to be removed, its local information is obtained by subtracting the global information aggregated from its neighboring preserved nodes. Subsequently, this local information is aligned and propagated to the preserved nodes to alleviate information loss in graph coarsening. Furthermore, we demonstrate that the proposed LiftPool is localized and permutation-invariant. The proposed graph lifting structure is general to be integrated with existing downsampling-based graph pooling methods. Evaluations on benchmark graph datasets show that LiftPool substantially outperforms the state-of-the-art graph pooling methods in the task of graph classification.  ( 2 min )
    Performance and Interpretability Comparisons of Supervised Machine Learning Algorithms: An Empirical Study. (arXiv:2204.12868v1 [stat.ML])
    This paper compares the performances of three supervised machine learning algorithms in terms of predictive ability and model interpretation on structured or tabular data. The algorithms considered were scikit-learn implementations of extreme gradient boosting machines (XGB) and random forests (RFs), and feedforward neural networks (FFNNs) from TensorFlow. The paper is organized in a findings-based manner, with each section providing general conclusions supported by empirical results from simulation studies that cover a wide range of model complexity and correlation structures among predictors. We considered both continuous and binary responses of different sample sizes. Overall, XGB and FFNNs were competitive, with FFNNs showing better performance in smooth models and tree-based boosting algorithms performing better in non-smooth models. This conclusion held generally for predictive performance, identification of important variables, and determining correct input-output relationships as measured by partial dependence plots (PDPs). FFNNs generally had less over-fitting, as measured by the difference in performance between training and testing datasets. However, the difference with XGB was often small. RFs did not perform well in general, confirming the findings in the literature. All models exhibited different degrees of bias seen in PDPs, but the bias was especially problematic for RFs. The extent of the biases varied with correlation among predictors, response type, and data set sample size. In general, tree-based models tended to over-regularize the fitted model in the tails of predictor distributions. Finally, as to be expected, performances were better for continuous responses compared to binary data and with larger samples.  ( 2 min )
    Detecting Backdoor Poisoning Attacks on Deep Neural Networks by Heatmap Clustering. (arXiv:2204.12848v1 [cs.LG])
    Predicitions made by neural networks can be fraudulently altered by so-called poisoning attacks. A special case are backdoor poisoning attacks. We study suitable detection methods and introduce a new method called Heatmap Clustering. There, we apply a $k$-means clustering algorithm on heatmaps produced by the state-of-the-art explainable AI method Layer-wise relevance propagation. The goal is to separate poisoned from un-poisoned data in the dataset. We compare this method with a similar method, called Activation Clustering, which also uses $k$-means clustering but applies it on the activation of certain hidden layers of the neural network as input. We test the performance of both approaches for standard backdoor poisoning attacks, label-consistent poisoning attacks and label-consistent poisoning attacks with reduced amplitude stickers. We show that Heatmap Clustering consistently performs better than Activation Clustering. However, when considering label-consistent poisoning attacks, the latter method also yields good detection performance.  ( 2 min )
    Accelerating Robot Learning of Contact-Rich Manipulations: A Curriculum Learning Study. (arXiv:2204.12844v1 [cs.RO])
    The Reinforcement Learning (RL) paradigm has been an essential tool for automating robotic tasks. Despite the advances in RL, it is still not widely adopted in the industry due to the need for an expensive large amount of robot interaction with its environment. Curriculum Learning (CL) has been proposed to expedite learning. However, most research works have been only evaluated in simulated environments, from video games to robotic toy tasks. This paper presents a study for accelerating robot learning of contact-rich manipulation tasks based on Curriculum Learning combined with Domain Randomization (DR). We tackle complex industrial assembly tasks with position-controlled robots, such as insertion tasks. We compare different curricula designs and sampling approaches for DR. Based on this study, we propose a method that significantly outperforms previous work, which uses DR only (No CL is used), with less than a fifth of the training time (samples). Results also show that even when training only in simulation with toy tasks, our method can learn policies that can be transferred to the real-world robot. The learned policies achieved success rates of up to 86\% on real-world complex industrial insertion tasks (with tolerances of $\pm 0.01~mm$) not seen during the training.  ( 2 min )
    Transfer Learning with Pre-trained Conditional Generative Models. (arXiv:2204.12833v1 [cs.LG])
    Transfer learning is crucial in training deep neural networks on new target tasks. Current transfer learning methods generally assume at least one of (i) source and target task label spaces must overlap, (ii) source datasets are available, and (iii) target network architectures are consistent with source ones. However, these all assumptions are difficult to hold in practical settings because the target task rarely has the same labels as the source task, the source dataset access is restricted due to licensing and storage costs, and the target architecture is often specialized to each task. To transfer source knowledge without these assumptions, we propose a transfer learning method that uses deep generative models and is composed of the following two stages: pseudo pre-training (PP) and pseudo semi-supervised learning (P-SSL). PP trains a target architecture with a synthesized dataset by using conditional source generative models. P-SSL applies SSL algorithms to labeled target data and unlabeled pseudo samples, which are generated by cascading the source classifier and generative models to condition them with target samples. Our experimental results indicate that our method can outperform baselines of scratch training and knowledge distillation.  ( 2 min )
    Learning to Parallelize in a Shared-Memory Environment with Transformers. (arXiv:2204.12835v1 [cs.DC])
    In past years, the world has switched to many-core and multi-core shared memory architectures. As a result, there is a growing need to utilize these architectures by introducing shared memory parallelization schemes to software applications. OpenMP is the most comprehensive API that implements such schemes, characterized by a readable interface. Nevertheless, introducing OpenMP into code is challenging due to pervasive pitfalls in management of parallel shared memory. To facilitate the performance of this task, many source-to-source (S2S) compilers have been created over the years, tasked with inserting OpenMP directives into code automatically. In addition to having limited robustness to their input format, these compilers still do not achieve satisfactory coverage and precision in locating parallelizable code and generating appropriate directives. In this work, we propose leveraging recent advances in ML techniques, specifically in natural language processing (NLP), to replace S2S compilers altogether. We create a database (corpus), Open-OMP, specifically for this goal. Open-OMP contains over 28,000 code snippets, half of which contain OpenMP directives while the other half do not need parallelization at all with high probability. We use the corpus to train systems to automatically classify code segments in need of parallelization, as well as suggest individual OpenMP clauses. We train several transformer models, named PragFormer, for these tasks, and show that they outperform statistically-trained baselines and automatic S2S parallelization compilers in both classifying the overall need for an OpenMP directive and the introduction of private and reduction clauses. Our source code and database are available at: https://github.com/Scientific-Computing-Lab-NRCN/PragFormer.  ( 2 min )
    Supervised Contrastive CSI Representation Learning for Massive MIMO Positioning. (arXiv:2204.12796v1 [cs.IT])
    Similarity metric is crucial for massive MIMO positioning utilizing channel state information~(CSI). In this letter, we propose a novel massive MIMO CSI similarity learning method via deep convolutional neural network~(DCNN) and contrastive learning. A contrastive loss function is designed considering multiple positive and negative CSI samples drawn from a training dataset. The DCNN encoder is trained using the loss so that positive samples are mapped to points close to the anchor's encoding, while encodings of negative samples are kept away from the anchor's in the representation space. Evaluation results of fingerprint-based positioning on a real-world CSI dataset show that the learned similarity metric improves positioning accuracy significantly compared with other known state-of-the-art methods.  ( 2 min )
    GTNet: A Tree-Based Deep Graph Learning Architecture. (arXiv:2204.12802v1 [cs.LG])
    We propose Graph Tree Networks (GTNets), a deep graph learning architecture with a new general message passing scheme that originates from the tree representation of graphs. In the tree representation, messages propagate upward from the leaf nodes to the root node, and each node preserves its initial information prior to receiving information from its child nodes (neighbors). We formulate a general propagation rule following the nature of message passing in the tree to update a node's feature by aggregating its initial feature and its neighbor nodes' updated features. Two graph representation learning models are proposed within this GTNet architecture - Graph Tree Attention Network (GTAN) and Graph Tree Convolution Network (GTCN), with experimentally demonstrated state-of-the-art performance on several popular benchmark datasets. Unlike the vanilla Graph Attention Network (GAT) and Graph Convolution Network (GCN) which have the "over-smoothing" issue, the proposed GTAN and GTCN models can go deep as demonstrated by comprehensive experiments and rigorous theoretical analysis.  ( 2 min )
    When Performance is not Enough -- A Multidisciplinary View on Clinical Decision Support. (arXiv:2204.12810v1 [cs.LG])
    Scientific publications about machine learning in healthcare are often about implementing novel methods and boosting the performance - at least from a computer science perspective. However, beyond such often short-lived improvements, much more needs to be taken into consideration if we want to arrive at a sustainable progress in healthcare. What does it take to actually implement such a system, make it usable for the domain expert, and possibly bring it into practical usage? Targeted at Computer Scientists, this work presents a multidisciplinary view on machine learning in medical decision support systems and covers information technology, medical, as well as ethical aspects. Along with an implemented risk prediction system in nephrology, challenges and lessons learned in a pilot project are presented.  ( 2 min )
    Learning Green's functions associated with parabolic partial differential equations. (arXiv:2204.12789v1 [math.NA])
    Given input-output pairs from a parabolic partial differential equation (PDE) in any spatial dimension $n\geq 1$, we derive the first theoretically rigorous scheme for learning the associated Green's function $G$. Until now, rigorously learning Green's functions associated with parabolic operators has been a major challenge in the field of scientific machine learning because $G$ may not be square-integrable when $n>1$, and time-dependent PDEs have transient dynamics. By combining the hierarchical low-rank structure of $G$ together with the randomized singular value decomposition, we construct an approximant to $G$ that achieves a relative error of $\smash{\mathcal{O}(\Gamma_\epsilon^{-1/2}\epsilon)}$ in the $L^1$-norm with high probability by using at most $\smash{\mathcal{O}(\epsilon^{-\frac{n+2}{2}}\log(1/\epsilon))}$ input-output training pairs, where $\Gamma_\epsilon$ is a measure of the quality of the training dataset for learning $G$, and $\epsilon>0$ is sufficiently small. Along the way, we extend the low-rank theory of Bebendorf and Hackbusch from elliptic PDEs in dimension $1\leq n\leq 3$ to parabolic PDEs in any dimensions, which shows that Green's functions associated with parabolic PDEs admit a low-rank structure on well-separated domains.  ( 2 min )
    SVD Perspectives for Augmenting DeepONet Flexibility and Interpretability. (arXiv:2204.12670v1 [cs.LG])
    Deep operator networks (DeepONets) are powerful architectures for fast and accurate emulation of complex dynamics. As their remarkable generalization capabilities are primarily enabled by their projection-based attribute, we investigate connections with low-rank techniques derived from the singular value decomposition (SVD). We demonstrate that some of the concepts behind proper orthogonal decomposition (POD)-neural networks can improve DeepONet's design and training phases. These ideas lead us to a methodology extension that we name SVD-DeepONet. Moreover, through multiple SVD analyses, we find that DeepONet inherits from its projection-based attribute strong inefficiencies in representing dynamics characterized by symmetries. Inspired by the work on shifted-POD, we develop flexDeepONet, an architecture enhancement that relies on a pre-transformation network for generating a moving reference frame and isolating the rigid components of the dynamics. In this way, the physics can be represented on a latent space free from rotations, translations, and stretches, and an accurate projection can be performed to a low-dimensional basis. In addition to flexibility and interpretability, the proposed perspectives increase DeepONet's generalization capabilities and computational efficiencies. For instance, we show flexDeepONet can accurately surrogate the dynamics of 19 variables in a combustion chemistry application by relying on 95% less trainable parameters than the ones of the vanilla architecture. We argue that DeepONet and SVD-based methods can reciprocally benefit from each other. In particular, the flexibility of the former in leveraging multiple data sources and multifidelity knowledge in the form of both unstructured data and physics-informed constraints has the potential to greatly extend the applicability of methodologies such as POD and PCA.  ( 2 min )
    Accelerated Continuous-Time Approximate Dynamic Programming via Data-Assisted Hybrid Control. (arXiv:2204.12707v1 [math.OC])
    We introduce a new closed-loop architecture for the online solution of approximate optimal control problems in the context of continuous-time systems. Specifically, we introduce the first algorithm that incorporates dynamic momentum in actor-critic structures to control continuous-time dynamic plants with an affine structure in the input. By incorporating dynamic momentum in our algorithm, we are able to accelerate the convergence properties of the closed-loop system, achieving superior transient performance compared to traditional gradient-descent based techniques. In addition, by leveraging the existence of past recorded data with sufficiently rich information properties, we dispense with the persistence of excitation condition traditionally imposed on the regressors of the critic and the actor. Given that our continuous-time momentum-based dynamics also incorporate periodic discrete-time resets that emulate restarting techniques used in the machine learning literature, we leverage tools from hybrid dynamical systems theory to establish asymptotic stability properties for the closed-loop system. We illustrate our results with a numerical example.  ( 2 min )
    Human-Centered Prior-Guided and Task-Dependent Multi-Task Representation Learning for Action Recognition Pre-Training. (arXiv:2204.12729v1 [cs.CV])
    Recently, much progress has been made for self-supervised action recognition. Most existing approaches emphasize the contrastive relations among videos, including appearance and motion consistency. However, two main issues remain for existing pre-training methods: 1) the learned representation is neutral and not informative for a specific task; 2) multi-task learning-based pre-training sometimes leads to sub-optimal solutions due to inconsistent domains of different tasks. To address the above issues, we propose a novel action recognition pre-training framework, which exploits human-centered prior knowledge that generates more informative representation, and avoids the conflict between multiple tasks by using task-dependent representations. Specifically, we distill knowledge from a human parsing model to enrich the semantic capability of representation. In addition, we combine knowledge distillation with contrastive learning to constitute a task-dependent multi-task framework. We achieve state-of-the-art performance on two popular benchmarks for action recognition task, i.e., UCF101 and HMDB51, verifying the effectiveness of our method.  ( 2 min )
    The Multimarginal Optimal Transport Formulation of Adversarial Multiclass Classification. (arXiv:2204.12676v1 [cs.LG])
    We study a family of adversarial multiclass classification problems and provide equivalent reformulations in terms of: 1) a family of generalized barycenter problems introduced in the paper and 2) a family of multimarginal optimal transport problems where the number of marginals is equal to the number of classes in the original classification problem. These new theoretical results reveal a rich geometric structure of adversarial learning problems in multiclass classification and extend recent results restricted to the binary classification setting. A direct computational implication of our results is that by solving either the barycenter problem and its dual, or the MOT problem and its dual, we can recover the optimal robust classification rule and the optimal adversarial strategy for the original adversarial problem. Examples with synthetic and real data illustrate our results.  ( 2 min )
    A Multi-Head Convolutional Neural Network With Multi-path Attention improves Image Denoising. (arXiv:2204.12736v1 [cs.CV])
    Recently, convolutional neural networks (CNNs) and attention mechanisms have been widely used in image denoising and achieved satisfactory performance. However, the previous works mostly use a single head to receive the noisy image, limiting the richness of extracted features. Therefore, a novel CNN with multiple heads (MH) named MHCNN is proposed in this paper, whose heads will receive the input images rotated by different rotation angles. MH makes MHCNN simultaneously utilize features of rotated images to remove noise. We also present a novel multi-path attention mechanism (MPA) to integrate these features effectively. Unlike previous attention mechanisms that handle pixel-level, channel-level, and patch-level features, MPA focuses on features at the image level. Experiments show MHCNN surpasses other state-of-the-art CNN models on additive white Gaussian noise (AWGN) denoising and real-world image denoising. Its peak signal-to-noise ratio (PSNR) results are higher than other networks, such as DnCNN, BRDNet, RIDNet, PAN-Net, and CSANN. It is also demonstrated that the proposed MH with MPA mechanism can be used as a pluggable component.  ( 2 min )
    DraftRec: Personalized Draft Recommendation for Winning in Multi-Player Online Battle Arena Games. (arXiv:2204.12750v1 [cs.AI])
    This paper presents a personalized character recommendation system for Multiplayer Online Battle Arena (MOBA) games which are considered as one of the most popular online video game genres around the world. When playing MOBA games, players go through a draft stage, where they alternately select a virtual character to play. When drafting, players select characters by not only considering their character preferences, but also the synergy and competence of their team's character combination. However, the complexity of drafting induces difficulties for beginners to choose the appropriate characters based on the characters of their team while considering their own champion preferences. To alleviate this problem, we propose DraftRec, a novel hierarchical model which recommends characters by considering each player's champion preferences and the interaction between the players. DraftRec consists of two networks: the player network and the match network. The player network captures the individual player's champion preference, and the match network integrates the complex relationship between the players and their respective champions. We train and evaluate our model from a manually collected 280,000 matches of League of Legends and a publicly available 50,000 matches of Dota2. Empirically, our method achieved state-of-the-art performance in character recommendation and match outcome prediction task. Furthermore, a comprehensive user survey confirms that DraftRec provides convincing and satisfying recommendations. Our code and dataset are available at https://github.com/dojeon-ai/DraftRec.  ( 2 min )
    Heterogeneous Ensemble Knowledge Transfer for Training Large Models in Federated Learning. (arXiv:2204.12703v1 [cs.LG])
    Federated learning (FL) enables edge-devices to collaboratively learn a model without disclosing their private data to a central aggregating server. Most existing FL algorithms require models of identical architecture to be deployed across the clients and server, making it infeasible to train large models due to clients' limited system resources. In this work, we propose a novel ensemble knowledge transfer method named Fed-ET in which small models (different in architecture) are trained on clients, and used to train a larger model at the server. Unlike in conventional ensemble learning, in FL the ensemble can be trained on clients' highly heterogeneous data. Cognizant of this property, Fed-ET uses a weighted consensus distillation scheme with diversity regularization that efficiently extracts reliable consensus from the ensemble while improving generalization by exploiting the diversity within the ensemble. We show the generalization bound for the ensemble of weighted models trained on heterogeneous datasets that supports the intuition of Fed-ET. Our experiments on image and language tasks show that Fed-ET significantly outperforms other state-of-the-art FL algorithms with fewer communicated parameters, and is also robust against high data-heterogeneity.  ( 2 min )
    Data-based price discrimination: information theoretic limitations and a minimax optimal strategy. (arXiv:2204.12723v1 [cs.GT])
    This paper studies the gap between the classical pricing theory and the data-based pricing theory. We focus on the problem of price discrimination with a continuum of buyer types based on a finite sample of observations. Our first set of results provides sharp lower bounds in the worst-case scenario for the discrepancy between any data-based pricing strategies and the theoretical optimal third-degree price discrimination (3PD) strategy (respectively, uniform pricing strategy) derived from the distribution (where the sample is drawn) ranging over a large class of distributions. Consequently, there is an inevitable gap between revenues based on any data-based pricing strategy and the revenue based on the theoretical optimal 3PD (respectively, uniform pricing) strategy. We then propose easy-to-implement data-based 3PD and uniform pricing strategies and show each strategy is minimax optimal in the sense that the gap between their respective revenue and the revenue based on the theoretical optimal 3PD (respectively, uniform pricing) strategy matches our worst-case lower bounds up to constant factors (that are independent of the sample size $n$). We show that 3PD strategies are revenue superior to uniform pricing strategies if and only if the sample size $n$ is large enough. In other words, if $n$ is below a threshold, uniform pricing strategies are revenue superior to 3PD strategies. We further provide upper bounds for the gaps between the welfare generated by our minimax optimal 3PD (respectively, uniform pricing) strategy and the welfare based on the theoretical optimal 3PD (respectively, uniform pricing) strategy.  ( 2 min )
    Relational Abstractions for Generalized Reinforcement Learning on Symbolic Problems. (arXiv:2204.12665v1 [cs.LG])
    Reinforcement learning in problems with symbolic state spaces is challenging due to the need for reasoning over long horizons. This paper presents a new approach that utilizes relational abstractions in conjunction with deep learning to learn a generalizable Q-function for such problems. The learned Q-function can be efficiently transferred to related problems that have different object names and object quantities, and thus, entirely different state spaces. We show that the learned generalized Q-function can be utilized for zero-shot transfer to related problems without an explicit, hand-coded curriculum. Empirical evaluations on a range of problems show that our method facilitates efficient zero-shot transfer of learned knowledge to much larger problem instances containing many objects.  ( 2 min )
    Machines of finite depth: towards a formalization of neural networks. (arXiv:2204.12786v1 [cs.LG])
    We provide a unifying framework where artificial neural networks and their architectures can be formally described as particular cases of a general mathematical construction--machines of finite depth. Unlike neural networks, machines have a precise definition, from which several properties follow naturally. Machines of finite depth are modular (they can be combined), efficiently computable and differentiable. The backward pass of a machine is again a machine and can be computed without overhead using the same procedure as the forward pass. We prove this statement theoretically and practically, via a unified implementation that generalizes several classical architectures--dense, convolutional, and recurrent neural networks with a rich shortcut structure--and their respective backpropagation rules.  ( 2 min )
    Understanding A Class of Decentralized and Federated Optimization Algorithms: A Multi-Rate Feedback Control Perspective. (arXiv:2204.12663v1 [cs.LG])
    Distributed algorithms have been playing an increasingly important role in many applications such as machine learning, signal processing, and control. Significant research efforts have been devoted to developing and analyzing new algorithms for various applications. In this work, we provide a fresh perspective to understand, analyze, and design distributed optimization algorithms. Through the lens of multi-rate feedback control, we show that a wide class of distributed algorithms, including popular decentralized/federated schemes, can be viewed as discretizing a certain continuous-time feedback control system, possibly with multiple sampling rates, such as decentralized gradient descent, gradient tracking, and federated averaging. This key observation not only allows us to develop a generic framework to analyze the convergence of the entire algorithm class. More importantly, it also leads to an interesting way of designing new distributed algorithms. We develop the theory behind our framework and provide examples to highlight how the framework can be used in practice.  ( 2 min )
    Gaussian Kernel Variance For an Adaptive Learning Method on Signals Over Graphs. (arXiv:2204.12629v1 [eess.SP])
    This paper discusses a special kind of a simple yet possibly powerful algorithm, called single-kernel Gradraker (SKG), which is an adaptive learning method predicting unknown nodal values in a network using known nodal values and the network structure. We aim to find out how to configure the special kind of the model in applying the algorithm. To be more specific, we focus on SKG with a Gaussian kernel and specify how to find a suitable variance for the kernel. To do so, we introduce two variables with which we are able to set up requirements on the variance of the Gaussian kernel to achieve (near-) optimal performance and can better understand how SKG works. Our contribution is that we introduce two variables as analysis tools, illustrate how predictions will be affected under different Gaussian kernels, and provide an algorithm finding a suitable Gaussian kernel for SKG with knowledge about the training network. Simulation results on real datasets are provided.  ( 2 min )
    Bounded Memory Adversarial Bandits with Composite Anonymous Delayed Feedback. (arXiv:2204.12764v1 [cs.LG])
    We study the adversarial bandit problem with composite anonymous delayed feedback. In this setting, losses of an action are split into $d$ components, spreading over consecutive rounds after the action is chosen. And in each round, the algorithm observes the aggregation of losses that come from the latest $d$ rounds. Previous works focus on oblivious adversarial setting, while we investigate the harder non-oblivious setting. We show non-oblivious setting incurs $\Omega(T)$ pseudo regret even when the loss sequence is bounded memory. However, we propose a wrapper algorithm which enjoys $o(T)$ policy regret on many adversarial bandit problems with the assumption that the loss sequence is bounded memory. Especially, for $K$-armed bandit and bandit convex optimization, we have $\mathcal{O}(T^{2/3})$ policy regret bound. We also prove a matching lower bound for $K$-armed bandit. Our lower bound works even when the loss sequence is oblivious but the delay is non-oblivious. It answers the open problem proposed in \cite{wang2021adaptive}, showing that non-oblivious delay is enough to incur $\tilde{\Omega}(T^{2/3})$ regret.  ( 2 min )
    SCGC : Self-Supervised Contrastive Graph Clustering. (arXiv:2204.12656v1 [cs.LG])
    Graph clustering discovers groups or communities within networks. Deep learning methods such as autoencoders (AE) extract effective clustering and downstream representations but cannot incorporate rich structural information. While Graph Neural Networks (GNN) have shown great success in encoding graph structure, typical GNNs based on convolution or attention variants suffer from over-smoothing, noise, heterophily, are computationally expensive and typically require the complete graph being present. Instead, we propose Self-Supervised Contrastive Graph Clustering (SCGC), which imposes graph-structure via contrastive loss signals to learn discriminative node representations and iteratively refined soft cluster labels. We also propose SCGC*, with a more effective, novel, Influence Augmented Contrastive (IAC) loss to fuse richer structural information, and half the original model parameters. SCGC(*) is faster with simple linear units, completely eliminate convolutions and attention of traditional GNNs, yet efficiently incorporates structure. It is impervious to layer depth and robust to over-smoothing, incorrect edges and heterophily. It is scalable by batching, a limitation in many prior GNN models, and trivially parallelizable. We obtain significant improvements over state-of-the-art on a wide range of benchmark graph datasets, including images, sensor data, text, and citation networks efficiently. Specifically, 20% on ARI and 18% on NMI for DBLP; overall 55% reduction in training time and overall, 81% reduction on inference time. Our code is available at : https://github.com/gayanku/SCGC  ( 2 min )
    Evaluation of Self-taught Learning-based Representations for Facial Emotion Recognition. (arXiv:2204.12624v1 [cs.CV])
    This work describes different strategies to generate unsupervised representations obtained through the concept of self-taught learning for facial emotion recognition (FER). The idea is to create complementary representations promoting diversity by varying the autoencoders' initialization, architecture, and training data. SVM, Bagging, Random Forest, and a dynamic ensemble selection method are evaluated as final classification methods. Experimental results on Jaffe and Cohn-Kanade datasets using a leave-one-subject-out protocol show that FER methods based on the proposed diverse representations compare favorably against state-of-the-art approaches that also explore unsupervised feature learning.  ( 2 min )
    Meta-Learning Based Early Fault Detection for Rolling Bearings via Few-Shot Anomaly Detection. (arXiv:2204.12637v1 [cs.LG])
    Early fault detection (EFD) of rolling bearings can recognize slight deviation of the health states and contribute to the stability of mechanical systems. In practice, very limited target bearing data are available to conduct EFD, which makes it hard to adapt to the EFD task of new bearings. To address this problem, many transfer learning based EFD methods utilize historical data to learn transferable domain knowledge and conduct early fault detection on new target bearings. However, most existing methods only consider the distribution drift across different working conditions but ignore the difference between bearings under the same working condition, which is called Unit-to-Unit Variability (UtUV). The setting of EFD with limited target data considering UtUV can be formulated as a Few-shot Anomaly Detection task. Therefore, this paper proposes a novel EFD method based on meta-learning considering UtUV. The proposed method can learn a generic metric based on Relation Network (RN) to measure the similarity between normal data and the new arrival target bearing data. Besides, the proposed method utilizes a health state embedding strategy to decrease false alarms. The performance of proposed method is tested on two bearing datasets. The results show that the proposed method can detect incipient faults earlier than the baselines with lower false alarms.  ( 2 min )
    Novel Applications for VAE-based Anomaly Detection Systems. (arXiv:2204.12577v1 [cs.LG])
    The recent rise in deep learning technologies fueled innovation and boosted scientific research. Their achievements enabled new research directions for deep generative modeling (DGM), an increasingly popular approach that can create novel and unseen data, starting from a given data set. As the technology shows promising applications, many ethical issues also arise. For example, their misuse can enable disinformation campaigns and powerful phishing attempts. Research also indicates different biases affect deep learning models, leading to social issues such as misrepresentation. In this work, we formulate a novel setting to deal with similar problems, showing that a repurposed anomaly detection system effectively generates novel data, avoiding generating specified unwanted data. We propose Variational Auto-encoding Binary Classifiers (V-ABC): a novel model that repurposes and extends the Auto-encoding Binary Classifier (ABC) anomaly detector, using the Variational Auto-encoder (VAE). We survey the limitations of existing approaches and explore many tools to show the model's inner workings in an interpretable way. This proposal has excellent potential for generative applications: models that rely on user-generated data could automatically filter out unwanted content, such as offensive language, obscene images, and misleading information.  ( 2 min )
    Self-scalable Tanh (Stan): Faster Convergence and Better Generalization in Physics-informed Neural Networks. (arXiv:2204.12589v1 [cs.LG])
    Physics-informed Neural Networks (PINNs) are gaining attention in the engineering and scientific literature for solving a range of differential equations with applications in weather modeling, healthcare, manufacturing, and so on. Poor scalability is one of the barriers to utilizing PINNs for many real-world problems. To address this, a Self-scalable tanh (Stan) activation function is proposed for the PINNs. The proposed Stan function is smooth, non-saturating, and has a trainable parameter. During training, it can allow easy flow of gradients to compute the required derivatives and also enable systematic scaling of the input-output mapping. It is also shown theoretically that the PINN with the proposed Stan function has no spurious stationary points when using gradient descent algorithms. The proposed Stan is tested on a couple of numerical studies involving general regression problems. It is subsequently used for solving multiple forward problems, which involve second-order derivatives and multiple dimensions, and an inverse problem where the thermal diffusivity is predicted through heat conduction in a rod. Our results of these case studies establish empirically that the Stan activation function can achieve better training and more accurate predictions than the state-of-the-art activation functions.  ( 2 min )
    Rate-Constrained Remote Contextual Bandits. (arXiv:2204.12620v1 [cs.LG])
    We consider a rate-constrained contextual multi-armed bandit (RC-CMAB) problem, in which a group of agents are solving the same contextual multi-armed bandit (CMAB) problem. However, the contexts are observed by a remotely connected entity, i.e., the decision-maker, that updates the policy to maximize the returned rewards, and communicates the arms to be sampled by the agents to a controller over a rate-limited communications channel. This framework can be applied to personalized ad placement, whenever the content owner observes the website visitors, and hence has the context, but needs to transmit the ads to be shown to a controller that is in charge of placing the marketing content. Consequently, the rate-constrained CMAB (RC-CMAB) problem requires the study of lossy compression schemes for the policy to be employed whenever the constraint on the channel rate does not allow the uncompressed transmission of the decision-maker's intentions. We characterize the fundamental information theoretic limits of this problem by letting the number of agents go to infinity, and study the regret that can be achieved, identifying the two distinct rate regions leading to linear and sub-linear regrets respectively. We then analyze the optimal compression scheme achievable in the limit with infinite agents, when using the forward and reverse KL divergence as distortion metric. Based on this, we also propose a practical coding scheme, and provide numerical results.  ( 2 min )
    RAMBO-RL: Robust Adversarial Model-Based Offline Reinforcement Learning. (arXiv:2204.12581v1 [cs.LG])
    Offline reinforcement learning (RL) aims to find near-optimal policies from logged data without further environment interaction. Model-based algorithms, which learn a model of the environment from the dataset and perform conservative policy optimisation within that model, have emerged as a promising approach to this problem. In this work, we present Robust Adversarial Model-Based Offline RL (RAMBO), a novel approach to model-based offline RL. To achieve conservatism, we formulate the problem as a two-player zero sum game against an adversarial environment model. The model is trained minimise the value function while still accurately predicting the transitions in the dataset, forcing the policy to act conservatively in areas not covered by the dataset. To approximately solve the two-player game, we alternate between optimising the policy and optimising the model adversarially. The problem formulation that we address is theoretically grounded, resulting in a PAC performance guarantee and a pessimistic value function which lower bounds the value function in the true environment. We evaluate our approach on widely studied offline RL benchmarks, and demonstrate that our approach achieves state of the art performance.  ( 2 min )
    hate-alert@DravidianLangTech-ACL2022: Ensembling Multi-Modalities for Tamil TrollMeme Classification. (arXiv:2204.12587v1 [cs.MM])
    Social media platforms often act as breeding grounds for various forms of trolling or malicious content targeting users or communities. One way of trolling users is by creating memes, which in most cases unites an image with a short piece of text embedded on top of it. The situation is more complex for multilingual(e.g., Tamil) memes due to the lack of benchmark datasets and models. We explore several models to detect Troll memes in Tamil based on the shared task, "Troll Meme Classification in DravidianLangTech2022" at ACL-2022. We observe while the text-based model MURIL performs better for Non-troll meme classification, the image-based model VGG16 performs better for Troll-meme classification. Further fusing these two modalities help us achieve stable outcomes in both classes. Our fusion model achieved a 0.561 weighted average F1 score and ranked second in this task.  ( 2 min )
    Protein 3D structure-based neural networks highly improve the accuracy in compound-protein binding affinity prediction. (arXiv:2204.12586v1 [q-bio.BM])
    Theoretically, the accuracy of computational models in predicting compound-protein binding affinities (CPAs) could be improved by the introduction of protein 3D structure information. However, most of these models still suffer from a low accuracy due to the lack of an efficient approach to encode informative protein features. The major challenge is how to combine the multi-modal information such as the residue sequence of the protein, residue atom coordinates and the torsion angles. To tackle this problem, we develop Fast Evolutional Attention and Thoroughgoing-graph Neural Networks (FeatNN) to facilitate the application of protein 3D structure information for predicting CPAs. Specifically, we established a novel end-to-end architecture to jointly embed torsion matrix, discrete distance matrix, and sequence information of protein and extract compound features with deep graph convolution layers. In addition, a new pairwise mapping attention mechanism is introduced to comprehensively learn potential interaction information between proteins and compounds. FeatNN considerably outperforms various state-of-the-art baselines in CPA prediction with the Pearson value elevated by about 35.7%. Thus, FeatNN provides an outstanding method for highly accurate CPA prediction and facilitates high-throughput virtual screening of drug candidates.  ( 2 min )
    Learning Eco-Driving Strategies at Signalized Intersections. (arXiv:2204.12561v1 [eess.SY])
    Signalized intersections in arterial roads result in persistent vehicle idling and excess accelerations, contributing to fuel consumption and CO2 emissions. There has thus been a line of work studying eco-driving control strategies to reduce fuel consumption and emission levels at intersections. However, methods to devise effective control strategies across a variety of traffic settings remain elusive. In this paper, we propose a reinforcement learning (RL) approach to learn effective eco-driving control strategies. We analyze the potential impact of a learned strategy on fuel consumption, CO2 emission, and travel time and compare with naturalistic driving and model-based baselines. We further demonstrate the generalizability of the learned policies under mixed traffic scenarios. Simulation results indicate that scenarios with 100% penetration of connected autonomous vehicles (CAV) may yield as high as 18% reduction in fuel consumption and 25% reduction in CO2 emission levels while even improving travel speed by 20%. Furthermore, results indicate that even 25% CAV penetration can bring at least 50% of the total fuel and emission reduction benefits.  ( 2 min )
    SoFaiR: Single Shot Fair Representation Learning. (arXiv:2204.12556v1 [cs.LG])
    To avoid discriminatory uses of their data, organizations can learn to map them into a representation that filters out information related to sensitive attributes. However, all existing methods in fair representation learning generate a fairness-information trade-off. To achieve different points on the fairness-information plane, one must train different models. In this paper, we first demonstrate that fairness-information trade-offs are fully characterized by rate-distortion trade-offs. Then, we use this key result and propose SoFaiR, a single shot fair representation learning method that generates with one trained model many points on the fairness-information plane. Besides its computational saving, our single-shot approach is, to the extent of our knowledge, the first fair representation learning method that explains what information is affected by changes in the fairness / distortion properties of the representation. Empirically, we find on three datasets that SoFaiR achieves similar fairness-information trade-offs as its multi-shot counterparts.  ( 2 min )
    Surrogate Assisted Evolutionary Multi-objective Optimisation applied to a Pressure Swing Adsorption system. (arXiv:2204.12585v1 [cs.NE])
    Chemical plant design and optimisation have proven challenging due to the complexity of these real-world systems. The resulting complexity translates into high computational costs for these systems' mathematical formulations and simulation models. Research has illustrated the benefits of using machine learning surrogate models as substitutes for computationally expensive models during optimisation. This paper extends recent research into optimising chemical plant design and operation. The study further explores Surrogate Assisted Genetic Algorithms (SA-GA) in more complex variants of the original plant design and optimisation problems, such as the inclusion of parallel and feedback components. The novel extension to the original algorithm proposed in this study, Surrogate Assisted NSGA-\Romannum{2} (SA-NSGA), was tested on a popular literature case, the Pressure Swing Adsorption (PSA) system. We further provide extensive experimentation, comparing various meta-heuristic optimisation techniques and numerous machine learning models as surrogates. The results for both sets of systems illustrate the benefits of using Genetic Algorithms as an optimisation framework for complex chemical plant system design and optimisation for both single and multi-objective scenarios. We confirm that Random Forest surrogate assisted Evolutionary Algorithms can be scaled to increasingly complex chemical systems with parallel and feedback components. We further find that combining a Genetic Algorithm framework with Machine Learning Surrogate models as a substitute for long-running simulation models yields significant computational efficiency improvements, 1.7 - 1.84 times speedup for the increased complexity examples and a 2.7 times speedup for the Pressure Swing Adsorption system.  ( 2 min )
    Toward Policy Explanations for Multi-Agent Reinforcement Learning. (arXiv:2204.12568v1 [cs.AI])
    Advances in multi-agent reinforcement learning(MARL) enable sequential decision making for a range of exciting multi-agent applications such as cooperative AI and autonomous driving. Explaining agent decisions are crucial for improving system transparency, increasing user satisfaction, and facilitating human-agent collaboration. However, existing works on explainable reinforcement learning mostly focus on the single-agent setting and are not suitable for addressing challenges posed by multi-agent environments. We present novel methods to generate two types of policy explanations for MARL: (i) policy summarization about the agent cooperation and task sequence, and (ii) language explanations to answer queries about agent behavior. Experimental results on three MARL domains demonstrate the scalability of our methods. A user study shows that the generated explanations significantly improve user performance and increase subjective ratings on metrics such as user satisfaction.  ( 2 min )
    Process Knowledge-infused Learning for Suicidality Assessment on Social Media. (arXiv:2204.12560v1 [cs.AI])
    Improving the performance and natural language explanations of deep learning algorithms is a priority for adoption by humans in the real world. In several domains, such as healthcare, such technology has significant potential to reduce the burden on humans by providing quality assistance at scale. However, current methods rely on the traditional pipeline of predicting labels from data, thus completely ignoring the process and guidelines used to obtain the labels. Furthermore, post hoc explanations on the data to label prediction using explainable AI (XAI) models, while satisfactory to computer scientists, leave much to be desired to the end-users due to lacking explanations of the process in terms of human-understandable concepts. We \textit{introduce}, \textit{formalize}, and \textit{develop} a novel Artificial Intelligence (A) paradigm -- Process Knowledge-infused Learning (PK-iL). PK-iL utilizes a structured process knowledge that explicitly explains the underlying prediction process that makes sense to end-users. The qualitative human evaluation confirms through a annotator agreement of 0.72, that humans are understand explanations for the predictions. PK-iL also performs competitively with the state-of-the-art (SOTA) baselines.  ( 2 min )
    Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages. (arXiv:2204.12543v1 [cs.CL])
    Abusive language is a growing concern in many social media platforms. Repeated exposure to abusive speech has created physiological effects on the target users. Thus, the problem of abusive language should be addressed in all forms for online peace and safety. While extensive research exists in abusive speech detection, most studies focus on English. Recently, many smearing incidents have occurred in India, which provoked diverse forms of abusive speech in online space in various languages based on the geographic location. Therefore it is essential to deal with such malicious content. In this paper, to bridge the gap, we demonstrate a large-scale analysis of multilingual abusive speech in Indic languages. We examine different interlingual transfer mechanisms and observe the performance of various multilingual models for abusive speech detection for eight different Indic languages. We also experiment to show how robust these models are on adversarial attacks. Finally, we conduct an in-depth error analysis by looking into the models' misclassified posts across various settings. We have made our code and models public for other researchers.  ( 2 min )
    Double Diffusion Maps and their Latent Harmonics for Scientific Computations in Latent Space. (arXiv:2204.12536v1 [stat.ML])
    We introduce a data-driven approach to building reduced dynamical models through manifold learning; the reduced latent space is discovered using Diffusion Maps (a manifold learning technique) on time series data. A second round of Diffusion Maps on those latent coordinates allows the approximation of the reduced dynamical models. This second round enables mapping the latent space coordinates back to the full ambient space (what is called lifting); it also enables the approximation of full state functions of interest in terms of the reduced coordinates. In our work, we develop and test three different reduced numerical simulation methodologies, either through pre-tabulation in the latent space and integration on the fly or by going back and forth between the ambient space and the latent space. The data-driven latent space simulation results, based on the three different approaches, are validated through (a) the latent space observation of the full simulation through the Nystr\"om Extension formula, or through (b) lifting the reduced trajectory back to the full ambient space, via Latent Harmonics. Latent space modeling often involves additional regularization to favor certain properties of the space over others, and the mapping back to the ambient space is then constructed mostly independently from these properties; here, we use the same data-driven approach to construct the latent space and then map back to the ambient space.  ( 2 min )
    Identification of feasible pathway information for c-di-GMP binding proteins in cellulose production. (arXiv:2204.12526v1 [q-bio.QM])
    In this paper, we utilize a machine learning approach to identify the significant pathways for c-di-GMP signaling proteins. The dataset involves gene counts from 12 pathways and 5 essential c-di-GMP binding domains for 1024 bacterial genomes. Two novel approaches, Least absolute shrinkage and selection operator (Lasso) and Random forests, have been applied for analyzing and modeling the dataset. Both approaches show that bacterial chemotaxis is the most essential pathway for c-di-GMP encoding domains. Though popular for feature selection, the strong regularization of Lasso method fails to associate any pathway to MshE domain. Results from the analysis may help to understand and emphasize the supporting pathways involved in bacterial cellulose production. These findings demonstrate the need for a chassis to restrict the behavior or functionality by deactivating the selective pathways in cellulose production.  ( 2 min )
    Multi stain graph fusion for multimodal integration in pathology. (arXiv:2204.12541v1 [eess.IV])
    In pathology, tissue samples are assessed using multiple staining techniques to enhance contrast in unique histologic features. In this paper, we introduce a multimodal CNN-GNN based graph fusion approach that leverages complementary information from multiple non-registered histopathology images to predict pathologic scores. We demonstrate this approach in nonalcoholic steatohepatitis (NASH) by predicting CRN fibrosis stage and NAFLD Activity Score (NAS). Primary assessment of NASH typically requires liver biopsy evaluation on two histological stains: Trichrome (TC) and hematoxylin and eosin (H&E). Our multimodal approach learns to extract complementary information from TC and H&E graphs corresponding to each stain while simultaneously learning an optimal policy to combine this information. We report up to 20% improvement in predicting fibrosis stage and NAS component grades over single-stain modeling approaches, measured by computing linearly weighted Cohen's kappa between machine-derived vs. pathologist consensus scores. Broadly, this paper demonstrates the value of leveraging diverse pathology images for improved ML-powered histologic assessment.  ( 2 min )
    Application of WGAN-GP in recommendation and Questioning the relevance of GAN-based approaches. (arXiv:2204.12527v1 [cs.IR])
    Many neural-based recommender systems were proposed in recent years and part of them used Generative Adversarial Networks (GAN) to model user-item interactions. However, the exploration of Wasserstein GAN with Gradient Penalty (WGAN-GP) on recommendation has received relatively less scrutiny. In this paper, we focus on two questions: 1- Can we successfully apply WGAN-GP on recommendation and does this approach give an advantage compared to the best GAN models? 2- Are GAN-based recommender systems relevant? To answer the first question, we propose a recommender system based on WGAN-GP called CFWGAN-GP which is founded on a previous model (CFGAN). We successfully applied our method on real-world datasets on the top-k recommendation task and the empirical results show that it is competitive with state-of-the-art GAN approaches, but we found no evidence of significant advantage of using WGAN-GP instead of the original GAN, at least from the accuracy point of view. As for the second question, we conduct a simple experiment in which we show that a well-tuned conceptually simpler method outperforms GAN-based models by a considerable margin, questioning the use of such models.  ( 2 min )
    Self-Supervised Information Bottleneck for Deep Multi-View Subspace Clustering. (arXiv:2204.12496v1 [cs.LG])
    In this paper, we explore the problem of deep multi-view subspace clustering framework from an information-theoretic point of view. We extend the traditional information bottleneck principle to learn common information among different views in a self-supervised manner, and accordingly establish a new framework called Self-supervised Information Bottleneck based Multi-view Subspace Clustering (SIB-MSC). Inheriting the advantages from information bottleneck, SIB-MSC can learn a latent space for each view to capture common information among the latent representations of different views by removing superfluous information from the view itself while retaining sufficient information for the latent representations of other views. Actually, the latent representation of each view provides a kind of self-supervised signal for training the latent representations of other views. Moreover, SIB-MSC attempts to learn the other latent space for each view to capture the view-specific information by introducing mutual information based regularization terms, so as to further improve the performance of multi-view subspace clustering. To the best of our knowledge, this is the first work to explore information bottleneck for multi-view subspace clustering. Extensive experiments on real-world multi-view data demonstrate that our method achieves superior performance over the related state-of-the-art methods.  ( 2 min )
    Enhancing Privacy against Inversion Attacks in Federated Learning by using Mixing Gradients Strategies. (arXiv:2204.12495v1 [cs.LG])
    Federated learning reduces the risk of information leakage, but remains vulnerable to attacks. We investigate how several neural network design decisions can defend against gradients inversion attacks. We show that overlapping gradients provides numerical resistance to gradient inversion on the highly vulnerable dense layer. Specifically, we propose to leverage batching to maximise mixing of gradients by choosing an appropriate loss function and drawing identical labels. We show that otherwise it is possible to directly recover all vectors in a mini-batch without any numerical optimisation due to the de-mixing nature of the cross entropy loss. To accurately assess data recovery, we introduce an absolute variation distance (AVD) metric for information leakage in images, derived from total variation. In contrast to standard metrics, e.g. Mean Squared Error or Structural Similarity Index, AVD offers a continuous metric for extracting information in noisy images. Finally, our empirical results on information recovery from various inversion attacks and training performance supports our defense strategies. These strategies are also shown to be useful for deep convolutional neural networks such as LeNET for image recognition. We hope that this study will help guide the development of further strategies that achieve a trustful federation policy.  ( 2 min )
    AI-Assisted Authentication: State of the Art, Taxonomy and Future Roadmap. (arXiv:2204.12492v1 [cs.CR])
    Artificial Intelligence (AI) has found its applications in a variety of environments ranging from data science to cybersecurity. AI helps break through the limitations of traditional algorithms and provides more efficient and flexible methods for solving problems. In this paper, we focus on the applications of artificial intelligence in authentication, which is used in a wide range of scenarios including facial recognition to access buildings, keystroke dynamics to unlock smartphones. With the emerging AI-assisted authentication schemes, our comprehensive survey provides an overall understanding on a high level, which paves the way for future research in this area. In contrast to other relevant surveys, our research is the first of its kind to focus on the roles of AI in authentication.  ( 2 min )
    One-shot Federated Learning without Server-side Training. (arXiv:2204.12493v1 [cs.LG])
    Federated Learning (FL) has recently made significant progress as a new machine learning paradigm for privacy protection. Due to the high communication cost of traditional FL, one-shot federated learning is gaining popularity as a way to reduce communication cost between clients and the server. Most of the existing one-shot FL methods are based on Knowledge Distillation; however, distillation based approach requires an extra training phase and depends on publicly available data sets. In this work, we consider a novel and challenging setting: performing a single round of parameter aggregation on the local models without server-side training on a public data set. In this new setting, we propose an effective algorithm for Model Aggregation via Exploring Common Harmonized Optima (MA-Echo), which iteratively updates the parameters of all local models to bring them close to a common low-loss area on the loss surface, without harming performance on their own data sets at the same time. Compared to the existing methods, MA-Echo can work well even in extremely non-identical data distribution settings where the support categories of each local model have no overlapped labels with those of the others. We conduct extensive experiments on two popular image classification data sets to compare the proposed method with existing methods and demonstrate the effectiveness of MA-Echo, which clearly outperforms the state-of-the-arts.  ( 2 min )
  • Open

    BINAS: Bilinear Interpretable Neural Architecture Search. (arXiv:2110.12399v3 [cs.LG] UPDATED)
    Practical use of neural networks often involves requirements on latency, energy and memory among others. A popular approach to find networks under such requirements is through constrained Neural Architecture Search (NAS). However, previous methods use complicated predictors for the accuracy of the network. Those predictors are hard to interpret and sensitive to many hyperparameters to be tuned, hence, the resulting accuracy of the generated models is often harmed. In this work we resolve this by introducing Bilinear Interpretable Neural Architecture Search (BINAS), that is based on an accurate and simple bilinear formulation of both an accuracy estimator and the expected resource requirement, together with a scalable search method with theoretical guarantees. The simplicity of our proposed estimator together with the intuitive way it is constructed bring interpretability through many insights about the contribution of different design choices. For example, we find that in the examined search space, adding depth and width is more effective at deeper stages of the network and at the beginning of each resolution stage. Our experiments show that BINAS generates comparable to or better architectures than other state-of-the-art NAS methods within a reduced marginal search cost, while strictly satisfying the resource constraints.  ( 2 min )
    Tensor decomposition for learning Gaussian mixtures from moments. (arXiv:2106.00555v2 [math.AG] UPDATED)
    In data processing and machine learning, an important challenge is to recover and exploit models that can represent accurately the data. We consider the problem of recovering Gaussian mixture models from datasets. We investigate symmetric tensor decomposition methods for tackling this problem, where the tensor is built from empirical moments of the data distribution. We consider identifiable tensors, which have a unique decomposition, showing that moment tensors built from spherical Gaussian mixtures have this property. We prove that symmetric tensors with interpolation degree strictly less than half their order are identifiable and we present an algorithm, based on simple linear algebra operations, to compute their decomposition. Illustrative experimentations show the impact of the tensor decomposition method for recovering Gaussian mixtures, in comparison with other state-of-the-art approaches.  ( 2 min )
    Identification of feasible pathway information for c-di-GMP binding proteins in cellulose production. (arXiv:2204.12526v1 [q-bio.QM])
    In this paper, we utilize a machine learning approach to identify the significant pathways for c-di-GMP signaling proteins. The dataset involves gene counts from 12 pathways and 5 essential c-di-GMP binding domains for 1024 bacterial genomes. Two novel approaches, Least absolute shrinkage and selection operator (Lasso) and Random forests, have been applied for analyzing and modeling the dataset. Both approaches show that bacterial chemotaxis is the most essential pathway for c-di-GMP encoding domains. Though popular for feature selection, the strong regularization of Lasso method fails to associate any pathway to MshE domain. Results from the analysis may help to understand and emphasize the supporting pathways involved in bacterial cellulose production. These findings demonstrate the need for a chassis to restrict the behavior or functionality by deactivating the selective pathways in cellulose production.  ( 2 min )
    Efficient Learning of the Parameters of Non-Linear Models using Differentiable Resampling in Particle Filters. (arXiv:2111.01409v2 [stat.ML] UPDATED)
    It has been widely documented that the sampling and resampling steps in particle filters cannot be differentiated. The {\itshape reparameterisation trick} was introduced to allow the sampling step to be reformulated into a differentiable function. We extend the {\itshape reparameterisation trick} to include the stochastic input to resampling therefore limiting the discontinuities in the gradient calculation after this step. Knowing the gradients of the prior and likelihood allows us to run particle Markov Chain Monte Carlo (p-MCMC) and use the No-U-Turn Sampler (NUTS) as the proposal when estimating parameters. We compare the Metropolis-adjusted Langevin algorithm (MALA), Hamiltonian Monte Carlo with different number of steps and NUTS. We consider two state-space models and show that NUTS improves the mixing of the Markov chain and can produce more accurate results in less computational time.  ( 2 min )
    Knowledge Transfer in Engineering Fleets: Hierarchical Bayesian Modelling for Multi-Task Learning. (arXiv:2204.12404v1 [stat.ML] CROSS LISTED)
    We propose a population-level analysis to address issues of data sparsity when building predictive models of engineering infrastructure. By sharing information between similar assets, hierarchical Bayesian modelling is used to improve the survival analysis of a truck fleet (hazard curves) and power prediction in a wind farm (power curves). In each example, a set of correlated functions are learnt over the asset fleet, in a combined inference, to learn a population model. Parameter estimation is improved when sub-fleets of assets are allowed to share correlated information at different levels in the hierarchy. In turn, groups with incomplete data automatically borrow statistical strength from those that are data-rich. The correlations can be inspected to inform which assets share information for which effect (i.e. parameter).  ( 2 min )
    Compressed sensing of low-rank plus sparse matrices. (arXiv:2007.09457v2 [math.NA] UPDATED)
    Expressing a matrix as the sum of a low-rank matrix plus a sparse matrix is a flexible model capturing global and local features in data popularized as Robust PCA (Candes et al., 2011; Chandrasekaran et al., 2009). Compressed sensing, matrix completion, and their variants (Eldar and Kutyniok, 2012; Foucart and Rauhut, 2013) have established that data satisfying low complexity models can be efficiently measured and recovered from a number of measurements proportional to the model complexity rather than the ambient dimension. This manuscript develops similar guarantees showing that $m\times n$ matrices that can be expressed as the sum of a rank-$r$ matrix and a $s$-sparse matrix can be recovered by computationally tractable methods from $\mathcal{O}(r(m+n-r)+s)\log(mn/s)$ linear measurements. More specifically, we establish that the low-rank plus sparse matrix set is closed provided the incoherence of the low-rank component is upper bounded as $\mu<\sqrt{mn}/(r\sqrt{s})$, and subsequently, the restricted isometry constants for the aforementioned matrices remain bounded independent of problem size provided $p/mn$, $s/p$, and $r(m+n-r)/p$ remain fixed. Additionally, we show that semidefinite programming and two hard threshold gradient descent algorithms, NIHT and NAHT, converge to the measured matrix provided the measurement operator's RIC's are sufficiently small. These results also provably solve convex and non-convex formulation of Robust PCA with the asymptotically optimal fraction of corruptions $\alpha=\mathcal{O}\left(1/(\mu r) \right)$, where $s = \alpha^2 mn$, and improve the previously best known guarantees by not requiring that the fraction of corruptions is spread in every column and row by being upper bounded by $\alpha$. Numerical experiments illustrating these results are shown for synthetic problems, dynamic-foreground/static-background separation, and multispectral imaging.  ( 2 min )
    Differentially Quantized Gradient Methods. (arXiv:2002.02508v4 [cs.LG] UPDATED)
    Consider the following distributed optimization scenario. A worker has access to training data that it uses to compute the gradients while a server decides when to stop iterative computation based on its target accuracy or delay constraints. The server receives all its information about the problem instance from the worker via a rate-limited noiseless communication channel. We introduce the principle we call Differential Quantization (DQ) that prescribes compensating the past quantization errors to direct the descent trajectory of a quantized algorithm towards that of its unquantized counterpart. Assuming that the objective function is smooth and strongly convex, we prove that Differentially Quantized Gradient Descent (DQ-GD) attains a linear contraction factor of $\max\{\sigma_{\mathrm{GD}}, \rho_n 2^{-R}\}$, where $\sigma_{\mathrm{GD}}$ is the contraction factor of unquantized gradient descent (GD), $\rho_n \geq 1$ is the covering efficiency of the quantizer, and $R$ is the bitrate per problem dimension $n$. Thus at any $R\geq\log_2 \rho_n /\sigma_{\mathrm{GD}}$ bits, the contraction factor of DQ-GD is the same as that of unquantized GD, i.e., there is no loss due to quantization. We show that no algorithm within a certain class can converge faster than $\max\{\sigma_{\mathrm{GD}}, 2^{-R}\}$. Since quantizers exist with $\rho_n \to 1$ as $n \to \infty$ (Rogers, 1963), this means that DQ-GD is asymptotically optimal. The principle of differential quantization continues to apply to gradient methods with momentum such as Nesterov's accelerated gradient descent, and Polyak's heavy ball method. For these algorithms as well, if the rate is above a certain threshold, there is no loss in contraction factor obtained by the differentially quantized algorithm compared to its unquantized counterpart. Experimental results on least-squares problems validate our theoretical analysis.  ( 3 min )
    Neural Collapse Inspired Attraction-Repulsion-Balanced Loss for Imbalanced Learning. (arXiv:2204.08735v2 [cs.LG] UPDATED)
    Class imbalance distribution widely exists in real-world engineering. However, the mainstream optimization algorithms that seek to minimize error will trap the deep learning model in sub-optimums when facing extreme class imbalance. It seriously harms the classification precision, especially on the minor classes. The essential reason is that the gradients of the classifier weights are imbalanced among the components from different classes. In this paper, we propose Attraction-Repulsion-Balanced Loss (ARB-Loss) to balance the different components of the gradients. We perform experiments on the large-scale classification and segmentation datasets and our ARB-Loss can achieve state-of-the-art performance via only one-stage training instead of 2-stage learning like nowadays SOTA works.  ( 2 min )
    The Multimarginal Optimal Transport Formulation of Adversarial Multiclass Classification. (arXiv:2204.12676v1 [cs.LG])
    We study a family of adversarial multiclass classification problems and provide equivalent reformulations in terms of: 1) a family of generalized barycenter problems introduced in the paper and 2) a family of multimarginal optimal transport problems where the number of marginals is equal to the number of classes in the original classification problem. These new theoretical results reveal a rich geometric structure of adversarial learning problems in multiclass classification and extend recent results restricted to the binary classification setting. A direct computational implication of our results is that by solving either the barycenter problem and its dual, or the MOT problem and its dual, we can recover the optimal robust classification rule and the optimal adversarial strategy for the original adversarial problem. Examples with synthetic and real data illustrate our results.
    Double Diffusion Maps and their Latent Harmonics for Scientific Computations in Latent Space. (arXiv:2204.12536v1 [stat.ML])
    We introduce a data-driven approach to building reduced dynamical models through manifold learning; the reduced latent space is discovered using Diffusion Maps (a manifold learning technique) on time series data. A second round of Diffusion Maps on those latent coordinates allows the approximation of the reduced dynamical models. This second round enables mapping the latent space coordinates back to the full ambient space (what is called lifting); it also enables the approximation of full state functions of interest in terms of the reduced coordinates. In our work, we develop and test three different reduced numerical simulation methodologies, either through pre-tabulation in the latent space and integration on the fly or by going back and forth between the ambient space and the latent space. The data-driven latent space simulation results, based on the three different approaches, are validated through (a) the latent space observation of the full simulation through the Nystr\"om Extension formula, or through (b) lifting the reduced trajectory back to the full ambient space, via Latent Harmonics. Latent space modeling often involves additional regularization to favor certain properties of the space over others, and the mapping back to the ambient space is then constructed mostly independently from these properties; here, we use the same data-driven approach to construct the latent space and then map back to the ambient space.
    Bounded Memory Adversarial Bandits with Composite Anonymous Delayed Feedback. (arXiv:2204.12764v1 [cs.LG])
    We study the adversarial bandit problem with composite anonymous delayed feedback. In this setting, losses of an action are split into $d$ components, spreading over consecutive rounds after the action is chosen. And in each round, the algorithm observes the aggregation of losses that come from the latest $d$ rounds. Previous works focus on oblivious adversarial setting, while we investigate the harder non-oblivious setting. We show non-oblivious setting incurs $\Omega(T)$ pseudo regret even when the loss sequence is bounded memory. However, we propose a wrapper algorithm which enjoys $o(T)$ policy regret on many adversarial bandit problems with the assumption that the loss sequence is bounded memory. Especially, for $K$-armed bandit and bandit convex optimization, we have $\mathcal{O}(T^{2/3})$ policy regret bound. We also prove a matching lower bound for $K$-armed bandit. Our lower bound works even when the loss sequence is oblivious but the delay is non-oblivious. It answers the open problem proposed in \cite{wang2021adaptive}, showing that non-oblivious delay is enough to incur $\tilde{\Omega}(T^{2/3})$ regret.
    Performance and Interpretability Comparisons of Supervised Machine Learning Algorithms: An Empirical Study. (arXiv:2204.12868v1 [stat.ML])
    This paper compares the performances of three supervised machine learning algorithms in terms of predictive ability and model interpretation on structured or tabular data. The algorithms considered were scikit-learn implementations of extreme gradient boosting machines (XGB) and random forests (RFs), and feedforward neural networks (FFNNs) from TensorFlow. The paper is organized in a findings-based manner, with each section providing general conclusions supported by empirical results from simulation studies that cover a wide range of model complexity and correlation structures among predictors. We considered both continuous and binary responses of different sample sizes. Overall, XGB and FFNNs were competitive, with FFNNs showing better performance in smooth models and tree-based boosting algorithms performing better in non-smooth models. This conclusion held generally for predictive performance, identification of important variables, and determining correct input-output relationships as measured by partial dependence plots (PDPs). FFNNs generally had less over-fitting, as measured by the difference in performance between training and testing datasets. However, the difference with XGB was often small. RFs did not perform well in general, confirming the findings in the literature. All models exhibited different degrees of bias seen in PDPs, but the bias was especially problematic for RFs. The extent of the biases varied with correlation among predictors, response type, and data set sample size. In general, tree-based models tended to over-regularize the fitted model in the tails of predictor distributions. Finally, as to be expected, performances were better for continuous responses compared to binary data and with larger samples.
    Transfer Learning with Pre-trained Conditional Generative Models. (arXiv:2204.12833v1 [cs.LG])
    Transfer learning is crucial in training deep neural networks on new target tasks. Current transfer learning methods generally assume at least one of (i) source and target task label spaces must overlap, (ii) source datasets are available, and (iii) target network architectures are consistent with source ones. However, these all assumptions are difficult to hold in practical settings because the target task rarely has the same labels as the source task, the source dataset access is restricted due to licensing and storage costs, and the target architecture is often specialized to each task. To transfer source knowledge without these assumptions, we propose a transfer learning method that uses deep generative models and is composed of the following two stages: pseudo pre-training (PP) and pseudo semi-supervised learning (P-SSL). PP trains a target architecture with a synthesized dataset by using conditional source generative models. P-SSL applies SSL algorithms to labeled target data and unlabeled pseudo samples, which are generated by cascading the source classifier and generative models to condition them with target samples. Our experimental results indicate that our method can outperform baselines of scratch training and knowledge distillation.
    An Empirical Study of the Occurrence of Heavy-Tails in Training a ReLU Gate. (arXiv:2204.12554v1 [cs.LG])
    A particular direction of recent advance about stochastic deep-learning algorithms has been about uncovering a rather mysterious heavy-tailed nature of the stationary distribution of these algorithms, even when the data distribution is not so. Moreover, the heavy-tail index is known to show interesting dependence on the input dimension of the net, the mini-batch size and the step size of the algorithm. In this short note, we undertake an experimental study of this index for S.G.D. while training a $\relu$ gate (in the realizable and in the binary classification setup) and for a variant of S.G.D. that was proven in Karmakar and Mukherjee (2022) for ReLU realizable data. From our experiments we conjecture that these two algorithms have similar heavy-tail behaviour on any data where the latter can be proven to converge. Secondly, we demonstrate that the heavy-tail index of the late time iterates in this model scenario has strikingly different properties than either what has been proven for linear hypothesis classes or what has been previously demonstrated for large nets.
    Variational Kalman Filtering with Hinf-Based Correction for Robust Bayesian Learning in High Dimensions. (arXiv:2204.13089v1 [stat.ML])
    In this paper, we address the problem of convergence of sequential variational inference filter (VIF) through the application of a robust variational objective and Hinf-norm based correction for a linear Gaussian system. As the dimension of state or parameter space grows, performing the full Kalman update with the dense covariance matrix for a large scale system requires increased storage and computational complexity, making it impractical. The VIF approach, based on mean-field Gaussian variational inference, reduces this burden through the variational approximation to the covariance usually in the form of a diagonal covariance approximation. The challenge is to retain convergence and correct for biases introduced by the sequential VIF steps. We desire a framework that improves feasibility while still maintaining reasonable proximity to the optimal Kalman filter as data is assimilated. To accomplish this goal, a Hinf-norm based optimization perturbs the VIF covariance matrix to improve robustness. This yields a novel VIF- Hinf recursion that employs consecutive variational inference and Hinf based optimization steps. We explore the development of this method and investigate a numerical example to illustrate the effectiveness of the proposed filter.  ( 2 min )
    Accurate inference of crowdsourcing properties when using efficient allocation strategies. (arXiv:1903.03104v2 [cs.LG] UPDATED)
    Allocation strategies improve the efficiency of crowdsourcing by decreasing the work needed to complete individual tasks accurately. However, these algorithms introduce bias by preferentially allocating workers onto easy tasks, leading to sets of completed tasks that are no longer representative of all tasks. This bias challenges inference of problem-wide properties such as typical task difficulty or crowd properties such as worker completion times, important information that goes beyond the crowd responses themselves. Here we study inference about problem properties when using an allocation algorithm to improve crowd efficiency. We introduce Decision-Explicit Probability Sampling (DEPS), a novel method to perform inference of problem properties while accounting for the potential bias introduced by an allocation strategy. Experiments on real and synthetic crowdsourcing data show that DEPS outperforms baseline inference methods while still leveraging the efficiency gains of the allocation method. The ability to perform accurate inference of general properties when using non-representative data allows crowdsourcers to extract more knowledge out of a given crowdsourced dataset.  ( 2 min )
    Faster online calibration without randomization: interval forecasts and the power of two choices. (arXiv:2204.13087v1 [cs.LG])
    We study the problem of making calibrated probabilistic forecasts for a binary sequence generated by an adversarial nature. Following the seminal paper of Foster and Vohra (1998), nature is often modeled as an adaptive adversary who sees all activity of the forecaster except the randomization that the forecaster may deploy. A number of papers have proposed randomized forecasting strategies that achieve an $\epsilon$-calibration error rate of $O(1/\sqrt{T})$, which we prove is tight in general. On the other hand, it is well known that it is not possible to be calibrated without randomization, or if nature also sees the forecaster's randomization; in both cases the calibration error could be $\Omega(1)$. Inspired by the equally seminal works on the "power of two choices" and imprecise probability theory, we study a small variant of the standard online calibration problem. The adversary gives the forecaster the option of making two nearby probabilistic forecasts, or equivalently an interval forecast of small width, and the endpoint closest to the revealed outcome is used to judge calibration. This power of two choices, or imprecise forecast, accords the forecaster with significant power -- we show that a faster $\epsilon$-calibration rate of $O(1/T)$ can be achieved even without deploying any randomization.  ( 2 min )
    Scalable particle-based alternatives to EM. (arXiv:2204.12965v1 [stat.CO])
    Building on (Neal and Hinton, 1998), where the problem tackled by EM is recast as the optimization of a free energy functional on an infinite-dimensional space, we obtain three practical particle-based alternatives to EM applicable to broad classes of models. All three are derived through straightforward discretizations of gradient flows associated with the functional. The novel algorithms scale well to high-dimensional settings and outperform existing state-of-the-art methods in numerical experiments.  ( 2 min )
    Closing the Gap between Single-User and Multi-User VoiceFilter-Lite. (arXiv:2202.12169v2 [eess.AS] UPDATED)
    VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers. However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating a dual learning rate schedule and by using feature-wise linear modulation (FiLM) to condition the model with the attended speaker embedding, we successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations. At the same time, the new model can also be easily extended to support any number of users, and significantly outperforms our previously published model on multi-speaker evaluations.  ( 2 min )
    IH-GAN: A Conditional Generative Model for Implicit Surface-Based Inverse Design of Cellular Structures. (arXiv:2103.02588v4 [cs.CE] UPDATED)
    Variable-density cellular structures can overcome connectivity and manufacturability issues of topologically optimized structures, particularly those represented as discrete density maps. However, the optimization of such cellular structures is challenging due to the multiscale design problem. Past work addressing this problem generally either only optimizes the volume fraction of single-type unit cells but ignores the effects of unit cell geometry on properties, or considers the geometry-property relation but builds this relation via heuristics. In contrast, we propose a simple yet more principled way to accurately model the property to geometry mapping using a conditional deep generative model, named Inverse Homogenization Generative Adversarial Network (IH-GAN). It learns the conditional distribution of unit cell geometries given properties and can realize the one-to-many mapping from properties to geometries. We further reduce the complexity of IH-GAN by using the implicit function parameterization to represent unit cell geometries. Results show that our method can 1) generate various unit cells that satisfy given material properties with high accuracy ($R^2$-scores between target properties and properties of generated unit cells $>98\%$) and 2) improve the optimized structural performance over the conventional variable-density single-type structure. In the minimum compliance example, our IH-GAN generated structure achieves a $79.7\%$ reduction in concentrated stress and an extra $3.03\%$ reduction in displacement. In the target deformation examples, our IH-GAN generated structure reduces the target matching error by $86.4\%$ and $79.6\%$ for two test cases, respectively. We also demonstrated that the connectivity issue for multi-type unit cells can be solved by transition layer blending.  ( 3 min )
    Forecasting Foreign Exchange Rates With Parameter-Free Regression Networks Tuned By Bayesian Optimization. (arXiv:2204.12914v1 [q-fin.ST])
    The article is concerned with the problem of multi-step financial time series forecasting of Foreign Exchange (FX) rates. To address this problem, we introduce a parameter-free regression network termed RegPred Net. The exchange rate to forecast is treated as a stochastic process. It is assumed to follow a generalization of Brownian motion and the mean-reverting process referred to as the generalized Ornstein-Uhlenbeck (OU) process, with time-dependent coefficients. Using past observed values of the input time series, these coefficients can be regressed online by the cells of the first half of the network (Reg). The regressed coefficients depend only on - but are very sensitive to - a small number of hyperparameters required to be set by a global optimization procedure for which, Bayesian optimization is an adequate heuristic. Thanks to its multi-layered architecture, the second half of the regression network (Pred) can project time-dependent values for the OU process coefficients and generate realistic trajectories of the time series. Predictions can be easily derived in the form of expected values estimated by averaging values obtained by Monte Carlo simulation. The forecasting accuracy on a 100 days horizon is evaluated for several of the most important FX rates such as EUR/USD, EUR/CNY, and EUR/GBP. Our experimental results show that the RegPred Net significantly outperforms ARMA, ARIMA, LSTMs, and Autoencoder-LSTM models in this task.  ( 2 min )
    On the Dynamics of Inference and Learning. (arXiv:2204.12939v1 [cond-mat.dis-nn])
    Statistical Inference is the process of determining a probability distribution over the space of parameters of a model given a data set. As more data becomes available this probability distribution becomes updated via the application of Bayes' theorem. We present a treatment of this Bayesian updating process as a continuous dynamical system. Statistical inference is then governed by a first order differential equation describing a trajectory or flow in the information geometry determined by a parametric family of models. We solve this equation for some simple models and show that when the Cram\'{e}r-Rao bound is saturated the learning rate is governed by a simple $1/T$ power-law, with $T$ a time-like variable denoting the quantity of data. The presence of hidden variables can be incorporated in this setting, leading to an additional driving term in the resulting flow equation. We illustrate this with both analytic and numerical examples based on Gaussians and Gaussian Random Processes and inference of the coupling constant in the 1D Ising model. Finally we compare the qualitative behaviour exhibited by Bayesian flows to the training of various neural networks on benchmarked data sets such as MNIST and CIFAR10 and show how that for networks exhibiting small final losses the simple power-law is also satisfied.  ( 2 min )
    First do no harm: counterfactual objective functions for safe & ethical AI. (arXiv:2204.12993v1 [cs.AI])
    To act safely and ethically in the real world, agents must be able to reason about harm and avoid harmful actions. In this paper we develop the first statistical definition of harm and a framework for factoring harm into algorithmic decisions. We argue that harm is fundamentally a counterfactual quantity, and show that standard machine learning algorithms are guaranteed to pursue harmful policies in certain environments. To resolve this, we derive a family of counterfactual objective functions that robustly mitigate for harm. We demonstrate our approach with a statistical model for identifying optimal drug doses. While identifying optimal doses using the causal treatment effect results in harmful treatment decisions, our counterfactual algorithm identifies doses that are far less harmful without sacrificing efficacy. Our results show that counterfactual reasoning is a key ingredient for safe and ethical AI.  ( 2 min )

  • Open

    [P] Auto-encoder image dimension error
    Hello, I have a problem regarding my auto-encoder model. I built it so it can be able to give a MSE error abnormally high when presented an image too different from what it knows. I have this little function at the end of my program to predict whether or not the image presented is an anomaly : ` def check_anomaly(img_path): density_threshold = 2500 #Set this value based on the above exercise reconstruction_error_threshold = 0.004 # Set this value based on the above exercise img = Image.open(img_path) img = np.array(img.resize((128,128), Image.ANTIALIAS)) plt.imshow(img) img = img / 255. img = img[np.newaxis, :,:,:] encoded_img = encoder_model.predict([[img]]) encoded_img = [np.reshape(img, (out_vector_shape)) for img in encoded_img] density = kde.score_samples(encoded_img)[0] reconstruc…  ( 1 min )
    What's the actual difference in simple terms between Cloudera Data Science Workbench, DataRobot, and DataBricks? [Discussion]
    I've also noticed Cloudera and DataRobot seem to follow user pricing where DataBricks follows the standard hourly machine rate. submitted by /u/SmarterChild8675309 [link] [comments]
    [Project] Multi-Bounding box identification
    I have a data set of satellite images of aircraft. For the labels of these images I have the bounding boxes for all aircraft in that image. I looking for some advice on where to get started for this. I need to estimate all bounding boxes for these aircraft, any suggestions? submitted by /u/The_Dov [link] [comments]
    [D] NLP: anyone familiar with taxonomy extraction and evaluation?
    Hi all, does anyone have experience with automatic taxonomy/ontology extraction from unlabelled corpus, especially how to evaluate the extracted structures without gold standards? most published papers would invite students/researchers to conduct manual reviews, thus making it very difficult to compare the results. thanks in advance. submitted by /u/CestLucas [link] [comments]
    [D] Precision-Recall curve with best F1 Score of 0.56
    I am evaluating the performance of a model. I get different Precison-Recall curves for different configurations and the best F1 score (0.56) corresponds to the red point. Would you consider this performance acceptable? I know there is a lot of room for improvement but is it okay or is it extremely poor performance? https://preview.redd.it/k5qihccdb5w81.png?width=784&format=png&auto=webp&s=a0f3408f2ff671ee491faa50752b6c8e0cf17e4c Also, if I understood it correctly, each point of the Precision-Recall curve has its own F1 Score, right? Thank you! submitted by /u/SeaResponsibility176 [link] [comments]  ( 1 min )
    [P] surgeon-pytorch – a small library to inspect intermediate layers of pyTorch models
    I've made surgeon https://github.com/archinetai/surgeon-pytorch, a small library to inspect the intermediate output layers of pyTorch models without changing the original implementation. This can be very useful if you are using pre-trained models (e.g. from Huggingface or torch.hub) and want to get embeddings, attention matrices, or simply debug the model without adding additional code – which is often hard to do without changing the implementation. I hope this can be helpful to anyone! submitted by /u/Aglitter [link] [comments]  ( 1 min )
    [D] Thoughts on the AI4 conference?
    Lately, I have been receiving many messages from the organizers of this conference that they have some free passes to pass on to attend it: https://ai4.io/usa/ While the speaker line-up from the industry appears impressive, I have not heard of this conference before. Any thoughts on how legit/good this conference is? submitted by /u/roalddahl14 [link] [comments]  ( 1 min )
    [D] Has anyone attempted to recreate the work describe in the Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry Paper?
    The D3VO paper from the Tech. University of Munich looks quite promising, however that Uni does not seem very keen on publishing code along with their papers. Has already tried to reproduce their work? What kind of results did you see in your version? What hardware did you run it on? How difficult was it reproduce their models in pytorch? submitted by /u/autojazari [link] [comments]  ( 1 min )
    [D] - Are GFlow Nets considered diffusion models?
    I stumbled upon GFlow Net and in my opinion, it looks very similar to diffusion models. There is a touch of RL in GFlow Net but the main idea is very similar to diffusion models. is that right? or am I missing something? submitted by /u/dimem16 [link] [comments]
    [D] How is it checked if models do not just memorize their training examples?
    Hello everyone, this post is about generative models! (i.e. Score-based-generative models, GANs, etc.) on leaderboards like this https://paperswithcode.com/sota/image-generation-on-cifar-10 How do they check if the models do not just memorize the training examples? The FID score would be optimal in case you would just generate training examples again. Best submitted by /u/future_gcp_poweruser [link] [comments]  ( 5 min )
    [D] ICLR 2022 blog post track
    It seems like the list of accepted blog posts has been published for the ICLR 2022 blog post track (https://iclr-blog-track.github.io/). They also invited Karpathy to publish his most recent blog post on the history and future of convnets. What do you think about the blog posts? Any blog posts that you particularly like or that are definitely worth reading? Any blog posts that actually have interesting contributions and/or that you plan to cite? But maybe more importantly: how do you think this will evolve? They seem to have decided to organise the blog post track again for next year's ICLR already. Do you think these kind of publications could have an impact on how we do science or is this rather a nice extra on top of scientific work? There has only been one post in this subreddit on one of the accepted blog posts (since the announcement). I think there are some nice blog posts in there, so I expected to find some discussions here, but it seems like it is either ignored or not worth discussing. Therefore, I thought it would be interesting to (try to) start a discussion. TL;DR: what do you think of the ICLR blog post track and/or the accepted blog posts? submitted by /u/mr_tsjolder [link] [comments]  ( 1 min )
    [D] MLOps tools for automatic fine tuning of deployed machine learning models
    I'm working on a ML model for data extraction out of documents. The model is trained and deployed into production. For fine tuning the model and improving performance I added the possibility for users to correct the extracted data. A concrete example would be: the model labels a word in the document as "company_name" the user corrects it to "street_name". This correction is then used to fine tune the model. Currently the fine tuning is done manually. That is, if the number of corrections exceeds some threshold I take them and start a new training and evaluate the new model before putting it into production. My question would be: is there an MLOps tool that automates this process? Or should I write one myself? I am aware of the tool seldon core that offers A/B testing for comparing the old model with the new one and putting it into production. But unfortunately it does not offer automatic fine tuning. Or that's what I understood from their website. submitted by /u/alzoubi36 [link] [comments]  ( 1 min )
    [P] Create interactive slides for Machine Learning models from Jupyter Notebook
    Would you like to create an interactive presentation for your ML model directly from Jupyter Notebook? I'm working on an open-source project for converting notebooks into interactive documents. Recently, I've added the option to turn notebooks into interactive slides. You can showcase your Machine Learning model as an interactive presentation. During the presentation, the user can change values and recompute the slide! I've created a demo presentation where I use Random Forest to predict Iris species. The screenshot recording of the presentation: https://github.com/pplonski/ml-model-slides/raw/main/media/slides-from-ml-model.gif The presentation is available online (deployed at Heroku) https://ml-model-presentation.herokuapp.com/ What is more, the presentation code is on GitHub https://github.com/pplonski/ml-model-slides (yes, code is presentation! bye-bye PPT) In case anyone is interested, the framework is called Mercury. submitted by /u/pp314159 [link] [comments]  ( 1 min )
  • Open

    [2202.12742] Learning Relative Return Policies With Upside-Down Reinforcement Learning
    submitted by /u/chimp73 [link] [comments]
    What does it mean to centralise the observation in MARL?
    submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Off-policy algorithm with batch of actions
    Hello everyone, first of all sorry for my poor English. I am fairly new to Deep RL, and RL in general. I want to implement custom environment with multiple robots in it. Each robot would be given same task to do, so what I am trying to do is simulate parallelism. My question is: if i have 100 robots in simulation, do i have to instantiate 100 agents (neural networks) to control those robots or single network with batch size of 100 will suffice? Does batch of observations in anyway disturbs agent action (network) output? It would be much more memory efficient with single neural network. So far, I've seen that in process of acting in env agent takes observation of batch size 1 and outputs corresponding action for that observation. Since each observation from the batch is propagated trough the network without calculating gradient it should not be affected by the batch size. If anyone could explain to me if my way of thinking is wrong, and why. :) submitted by /u/Dexter_fixxor [link] [comments]  ( 1 min )
    "NeuPL: Neural Population Learning", Liu et al 2022 (encoding PBT agents into a single multi-policy agent)
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    When will humorous AIs press our buttons with their jokes? | Psyche Ideas
    submitted by /u/estasfuera [link] [comments]
    Best AI for blog post creation? Or what tools do you use to get an outline faster?
    I've only every used writesonic.com I got premium acct access for free. I love it for when u need to quickly spin an article if I'm on a backlink building spree. Haven't tested against others but overall good, and improvements we being made on it all the time (in active development). submitted by /u/CliffWoolum [link] [comments]  ( 1 min )
    Showcase your ML model in a python Web GUI
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    John Deere is becoming one of the world's most important AI companies
    submitted by /u/estasfuera [link] [comments]  ( 1 min )
    I keep getting terrible advice from AI assistants. Help me rate the worst and the best responses :)
    I was playing around with several AI models and this sort of stuff happens all the time. https://preview.redd.it/6arq56nih3w81.png?width=1163&format=png&auto=webp&s=5ca68dce977ae671b0b928c2f1ec3cdd8478257f I decided to prepare a compilation and rank the answers. Can you help me with deciding how bad or good are some of the answers by taking a survey here? Edit: I apologize if some of you felt tricked into completing the survey by the original version of the post. It does have about 50 questions but they are mostly just yes/no, good/bad, so it shouldn't take longer than 10 minutes. I intend to write a piece about how AI assistants are doing and prepare a compilation of AI fails. I will share the results :) submitted by /u/KazRainer [link] [comments]  ( 1 min )
    Is there a dataset for personal items?
    Hi! Im looking for a dataset containing images of personal items (Wallet, keys, phone etc), annotated by bounding boxes. Cant seem to find anything, do anyone know of such a dataset? Thanks in advance! submitted by /u/ifinty [link] [comments]  ( 1 min )
    Can I get AI language bots, who've already been trained?
    I'm brand new to programming. I am thinking I need a bot to skim through pdf's and through trial and error, I can take the pdf's and make the bot come up with suggestions to scripts or books.However I need to use an AI which already has training in English grammatical and sentence structure and langue/information flow. not nescceecarily interpretation meaning or philosophy or other higher levels of language proficiency. Say I wanted to write a novel about sailors. I'd let it skim some 100 novels on sailing, and generate inputs. then I'd match it up against a separate set of Ai to fact check each other. or even with form blogs from some sailing subreddits or news feeds. I could do this with any topic. medicine, engineering, biology, drama novels etc. Are there any free/opensource bots who already know the English language and might read books and make suggestions based on guided inputs? submitted by /u/International__ [link] [comments]  ( 3 min )
  • Open

    Google Colab for Machine Learning Projects
    Have you ever wanted an easy-to-configure interactive environment to run your machine learning code that came with access to GPUs for free? Google Colab is the answer you’ve been looking for. It is a convenient and easy to use way to run Jupyter notebooks on the cloud and their free version comes with some limited […] The post Google Colab for Machine Learning Projects appeared first on Machine Learning Mastery.  ( 13 min )
  • Open

    Machine learning, harnessed to extreme computing, aids fusion energy development
    Linking techniques from machine learning with advanced numerical simulations, MIT researchers take an important step in state-of-the-art predictions for fusion plasmas.  ( 6 min )
  • Open

    MoLeR: Creating a path to more efficient drug design
    Drug discovery has come a long way from its roots in serendipity. It is now an increasingly rational process, in which one important phase, called lead optimization, is the stepwise search for promising drug candidate compounds in the lab. In this phase, expert medicinal chemists work to improve “hit” molecules—compounds that demonstrate some promising properties, […] The post MoLeR: Creating a path to more efficient drug design appeared first on Microsoft Research.  ( 6 min )
  • Open

    Answers Blowin’ in the Wind: HPC Code Gives Renewable Energy a Lift
    A hundred and forty turbines in the North Sea — and some GPUs in the cloud — pumped wind under the wings of David Standingford and Jamil Appa’s dream. As colleagues at a British aerospace firm, they shared a vision of starting a company to apply their expertise in high performance computing across many industries. Read article > The post Answers Blowin’ in the Wind: HPC Code Gives Renewable Energy a Lift appeared first on NVIDIA Blog.  ( 4 min )
    What Is Conversational AI? ZeroShot Bot CEO Jason Mars Explains
    Entrepreneur Jason Mars calls conversation our “first technology.” Before humans invented the wheel, crafted a spear or tamed fire, we mastered the superpower of talking to one another. That makes conversation an incredibly important tool. But if you’ve dealt with the automated chatbots deployed by the customer service arms of just about any big organization Read article > The post What Is Conversational AI? ZeroShot Bot CEO Jason Mars Explains appeared first on NVIDIA Blog.  ( 2 min )
  • Open

    TinyML, an underrated field of Machine Learning
    TinyML is a groundbreaking technology! Possessing a lot of potential it is sure to grow exponentially in the coming years.  ( 3 min )
2022-05-27T00:57:38.982Z osmosfeed 1.14.4